Open In Colab

Mounting Google Drive

In [64]:
from google.colab import drive
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
In [65]:
!pip3 install ftfy
Requirement already satisfied: ftfy in /usr/local/lib/python3.6/dist-packages (5.8)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from ftfy) (0.2.5)

Importing Libraries

In [66]:
# Using TensorFlow 1.x only in colab as found a issue with 2.3 version used by colab while working with DNN model fit. Did not observe any issue with Tensor flow 2.1 version on local jupyter enviornment.
%tensorflow_version 1.x
In [67]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import time, os, sys, itertools, re 
from PIL import Image
import warnings, pickle, string
from dateutil import parser
%matplotlib inline

# Data Visualization
import cufflinks as cf
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot

from ftfy import fix_text, badness

# Traditional Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Sequential Modeling
import keras.backend as K
from keras.models import Sequential, Model
from keras.layers.merge import Concatenate
from keras.layers import Input, Dropout, Flatten, Dense, Embedding, LSTM, GRU
from keras.layers import BatchNormalization, TimeDistributed, Conv1D, MaxPooling1D
from keras.constraints import max_norm, unit_norm
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping, ModelCheckpoint

# Tools & Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, auc
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

Reading the data from excel

In [68]:
data=pd.read_excel('/content/drive/MyDrive/Capstone/input_data.xlsx')
#data=pd.read_excel('input_data.xlsx')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8500 entries, 0 to 8499
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  8492 non-null   object
 1   Description        8499 non-null   object
 2   Caller             8500 non-null   object
 3   Assignment group   8500 non-null   object
dtypes: object(4)
memory usage: 265.8+ KB

Exploratory Data Analysis

Univariate visualization

Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.

The distribution of Assignment groups

Plots how the assignments groups are scattered across the dataset. The bar chart, histogram and pie chart tells the frequency of any ticket assigned to any group OR the tickets count for each group.

In [69]:
data.head()
Out[69]:
Short description Description Caller Assignment group
0 login issue -verified user details.(employee# & manager na... spxjnwir pjlcoqds GRP_0
1 outlook \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail... hmjdrvpb komuaywn GRP_0
2 cant log in to vpn \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail... eylqgodm ybqkwiam GRP_0
3 unable to access hr_tool page unable to access hr_tool page xbkucsvz gcpydteq GRP_0
4 skype error skype error owlgqjme qhcozdfx GRP_0
In [70]:
assignment_group_count=data['Assignment group'].value_counts()
assignment_group_count.describe()
Out[70]:
count      74.000000
mean      114.864865
std       465.747516
min         1.000000
25%         5.250000
50%        26.000000
75%        84.000000
max      3976.000000
Name: Assignment group, dtype: float64
In [71]:
plt.subplots(figsize=(50,10))
ax=sns.countplot(x='Assignment group', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=30)
plt.tight_layout
plt.show()
In [72]:
assignment_group_count.head(50)
Out[72]:
GRP_0     3976
GRP_8      661
GRP_24     289
GRP_12     257
GRP_9      252
GRP_2      241
GRP_19     215
GRP_3      200
GRP_6      184
GRP_13     145
GRP_10     140
GRP_5      129
GRP_14     118
GRP_25     116
GRP_33     107
GRP_4      100
GRP_29      97
GRP_18      88
GRP_16      85
GRP_17      81
GRP_31      69
GRP_7       68
GRP_34      62
GRP_26      56
GRP_40      45
GRP_28      44
GRP_41      40
GRP_15      39
GRP_30      39
GRP_42      37
GRP_20      36
GRP_45      35
GRP_22      31
GRP_1       31
GRP_11      30
GRP_21      29
GRP_47      27
GRP_23      25
GRP_48      25
GRP_62      25
GRP_60      20
GRP_39      19
GRP_27      18
GRP_37      16
GRP_36      15
GRP_44      15
GRP_50      14
GRP_65      11
GRP_53      11
GRP_52       9
Name: Assignment group, dtype: int64
In [73]:
assignment_group_count.tail(24)
Out[73]:
GRP_55    8
GRP_51    8
GRP_46    6
GRP_59    6
GRP_49    6
GRP_43    5
GRP_32    4
GRP_66    4
GRP_38    3
GRP_63    3
GRP_58    3
GRP_68    3
GRP_56    3
GRP_57    2
GRP_54    2
GRP_69    2
GRP_72    2
GRP_71    2
GRP_64    1
GRP_61    1
GRP_70    1
GRP_73    1
GRP_67    1
GRP_35    1
Name: Assignment group, dtype: int64

Check Missing Values in dataframe

In [74]:
data.isnull().sum()
Out[74]:
Short description    8
Description          1
Caller               0
Assignment group     0
dtype: int64
In [75]:
data[data["Short description"].isnull()]
Out[75]:
Short description Description Caller Assignment group
2604 NaN \r\n\r\nreceived from: ohdrnswl.rezuibdt@gmail... ohdrnswl rezuibdt GRP_34
3383 NaN \r\n-connected to the user system using teamvi... qftpazns fxpnytmk GRP_0
3906 NaN -user unable tologin to vpn.\r\n-connected to... awpcmsey ctdiuqwe GRP_0
3910 NaN -user unable tologin to vpn.\r\n-connected to... rhwsmefo tvphyura GRP_0
3915 NaN -user unable tologin to vpn.\r\n-connected to... hxripljo efzounig GRP_0
3921 NaN -user unable tologin to vpn.\r\n-connected to... cziadygo veiosxby GRP_0
3924 NaN name:wvqgbdhm fwchqjor\nlanguage:\nbrowser:mic... wvqgbdhm fwchqjor GRP_0
4341 NaN \r\n\r\nreceived from: eqmuniov.ehxkcbgj@gmail... eqmuniov ehxkcbgj GRP_0

Copy Short Description to Description if the Description value is NaN

In [76]:
data.Description.fillna(data["Short description"], inplace = True)
In [77]:
data[data["Description"].isnull()]
Out[77]:
Short description Description Caller Assignment group
In [78]:
data['Short description'] = data['Short description'].replace(np.nan, '', regex=True)
In [79]:
data.isnull().sum()
Out[79]:
Short description    0
Description          0
Caller               0
Assignment group     0
dtype: int64
In [80]:
init_notebook_mode()
cf.go_offline()

# Assignment group distribution
print('\033[1mTotal assignment groups:\033[0m', data['Assignment group'].nunique())

# Histogram
data['Assignment group'].iplot(
    kind='hist',
    xTitle='Assignment Group',
    yTitle='count',
    title='Assignment Group Distribution- Histogram (Fig-1)')

# Pie chart
assgn_grp = pd.DataFrame(data.groupby('Assignment group').size(),columns = ['Count']).reset_index()
assgn_grp.iplot(
    kind='pie', 
    labels='Assignment group', 
    values='Count', 
    title='Assignment Group Distribution- Pie Chart (Fig-2)', 
    hoverinfo="label+percent+name", hole=0.25)
Total assignment groups: 74

Lets visualize the percentage of incidents per assignment group

In [81]:
# Plot to visualize the percentage data distribution across different groups
sns.set(style="whitegrid")
plt.figure(figsize=(20,5))
ax = sns.countplot(x="Assignment group", data=data, order=data["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
for p in ax.patches:
  ax.annotate(str(format(p.get_height()/len(data.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')

Top 20 and Bottom 20 assignment groups

In [82]:
top_20 = data['Assignment group'].value_counts().nlargest(20).reset_index()
In [83]:
plt.figure(figsize=(12,6))
bars = plt.bar(top_20['index'],top_20['Assignment group'])
plt.title('Top 20 Assignment groups with highest number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')

for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
In [84]:
bottom_20 = data['Assignment group'].value_counts().nsmallest(20).reset_index()
In [85]:
plt.figure(figsize=(12,6))
bars = plt.bar(bottom_20['index'],bottom_20['Assignment group'])
plt.title('Bottom 20 Assignment groups with small number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()

The distribution of Callers

Plots how the callers are associated with tickets and what are the assignment groups they most frequently raise tickets for.

In [86]:
# Find out top 10 callers in terms of frequency of raising tickets in the entire dataset
print('\033[1mTotal caller count:\033[0m', data['Caller'].nunique())
df = pd.DataFrame(data.groupby(['Caller']).size().nlargest(10), columns=['Count']).reset_index()
df.iplot(kind='pie',
         labels='Caller', 
         values='Count', 
         title='Top 10 caller- Pie Chart (Fig-7)',
         colorscale='-spectral',
         pull=[0,0,0,0,0.05,0.1,0.15,0.2,0.25,0.3])
Total caller count: 2950

Top 5 callers in each assignment group

In [87]:
top_n = 5
s = data['Caller'].groupby(data['Assignment group']).value_counts()
caller_grp = pd.DataFrame(s.groupby(level=0).nlargest(top_n).reset_index(level=0, drop=True))
caller_grp.head(15)
Out[87]:
Caller
Assignment group Caller
GRP_0 fumkcsji sarmtlhy 132
rbozivdq gmlhrtvp 86
olckhmvx pcqobjnd 54
efbwiadp dicafxhv 45
mfeyouli ndobtzpw 13
GRP_1 bpctwhsn kzqsbmtp 6
jloygrwh acvztedi 4
jyoqwxhz clhxsoqy 3
spxqmiry zpwgoqju 3
kbnfxpsy gehxzayq 2
GRP_10 bpctwhsn kzqsbmtp 60
ihfkwzjd erbxoyqk 6
dizquolf hlykecxa 5
gnasmtvx cwxtsvkm 3
hlrmufzx qcdzierm 3

The distribution of description lengths

Plots the variation of length and word count of new description attribute

In [88]:
data.insert(1, 'desc_len', data['Description'].astype(str).apply(len))
data.insert(5, 'desc_word_count', data['Description'].apply(lambda x: len(str(x).split())))
data.head()
Out[88]:
Short description desc_len Description Caller Assignment group desc_word_count
0 login issue 206 -verified user details.(employee# & manager na... spxjnwir pjlcoqds GRP_0 33
1 outlook 194 \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail... hmjdrvpb komuaywn GRP_0 25
2 cant log in to vpn 87 \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail... eylqgodm ybqkwiam GRP_0 11
3 unable to access hr_tool page 29 unable to access hr_tool page xbkucsvz gcpydteq GRP_0 5
4 skype error 12 skype error owlgqjme qhcozdfx GRP_0 2
In [89]:
# Description text length
data['desc_len'].iplot(
    kind='bar',
    xTitle='text length',
    yTitle='count',
    colorscale='-ylgn',
    title='Description Text Length Distribution (Fig-11)')

# Description word count
data['desc_word_count'].iplot(
    kind='bar',
    xTitle='word count',
    linecolor='black',
    yTitle='count',
    colorscale='-bupu',
    title='Description Word Count Distribution (Fig-12)')

Create a rule based engine

In [90]:
df_rules = pd.read_csv('/content/drive/MyDrive/Capstone/Rule_matrix.csv')
#df_rules = pd.read_csv("Rule_matrix.csv")
In [91]:
def applyRules(datadf,rulesdf,Description,ShortDescription):
    datadf['pred_group'] = np.nan
    for i, row in rulesdf.iterrows():                  
        for j, row in datadf.iterrows():
            if pd.notna(datadf[ShortDescription][j]):
                if (('erp' in datadf[ShortDescription][j]) and (('EU_tool' in datadf[ShortDescription][j]))):
                        datadf['pred_group'][j] = 'GRP_25'
        for j, row in datadf.iterrows():
            if pd.notna(datadf[Description][j]):
                if (datadf[Description][j] == 'the'):
                    datadf['pred_group'][j] = 'GRP_17' 
                
                if (('finance_app' in ((datadf[ShortDescription][j]) or datadf[Description][j])) and ('HostName_1132' not in datadf[ShortDescription][j])):
                    datadf['pred_group'][j] = 'GRP_55'
                
                if (('processor' in datadf[Description][j]) and ('engg' in datadf[Description][j])):
                    datadf['pred_group'][j] = 'GRP_58'
                
                                     
        if rulesdf['Short Desc Rule'][i] == 'begins with' and rulesdf['Desc Rule'][i] == 'begins with' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[ShortDescription][j]) and pd.notna(datadf[Description][j]):
                    if ((datadf[ShortDescription][j].startswith(rulesdf['Short Dec Keyword'][i])) and (datadf[Description][j].startswith(rulesdf['Dec keyword'][i]))):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
                        
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'begins with' and pd.notna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]) and pd.notna(datadf['Caller'][j]):
                    if ((datadf[Description][j].startswith(rulesdf['Desc Rule'][i]) and (rulesdf['User'][i] == datadf['Caller'][j]))):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
                        
        if rulesdf['Short Desc Rule'][i] == 'contains' and pd.notna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if (pd.notna(datadf[ShortDescription][j]) and pd.notna(datadf['Caller'][j])):
                     if ((rulesdf['Short Dec Keyword'][i] in datadf[ShortDescription][j]) and (rulesdf['User'][i] == datadf['Caller'][j])):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if rulesdf['Short Desc Rule'][i] == 'contains' and pd.isna(rulesdf['Desc Rule'][i]) and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[ShortDescription][j]):
                    if (rulesdf['Short Dec Keyword'][i] in datadf[ShortDescription][j]):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'begins with' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]):
                    if (datadf[Description][j].startswith(rulesdf['Dec keyword'][i])):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'contains' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]):
                    if (rulesdf['Dec keyword'][i] in datadf[Description][j]):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'not contain' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]):
                    if (rulesdf['Dec keyword'][i] in datadf[Description][j]):
                        datadf['pred_group'][j] = rulesdf['Group'][i]


        if rulesdf['Short Desc Rule'][i] == 'not contain' and pd.isna(rulesdf['Desc Rule'][i]) and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():

                if pd.notna(datadf[ShortDescription][j]):
                    if (rulesdf['Short Dec Keyword'][i] in datadf[ShortDescription][j]):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'not contain' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]):
                    if (datadf[Description][j].startswith(rulesdf['Dec keyword'][i])):
                        datadf['pred_group'][j] = rulesdf['Group'][i]
        if pd.isna(rulesdf['Short Desc Rule'][i]) and rulesdf['Desc Rule'][i] == 'contains' and pd.isna(rulesdf['User'][i]):
            for j, row in datadf.iterrows():
                if pd.notna(datadf[Description][j]):
                    if (rulesdf['Dec keyword'][i] in datadf[Description][j]):
                        datadf['pred_group'][j] = rulesdf['Group'][i]

    return datadf
In [92]:
rules_applied_df = applyRules(data,df_rules,'Description','Short description')
rules_applied_df
Out[92]:
Short description desc_len Description Caller Assignment group desc_word_count pred_group
0 login issue 206 -verified user details.(employee# & manager na... spxjnwir pjlcoqds GRP_0 33 NaN
1 outlook 194 \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail... hmjdrvpb komuaywn GRP_0 25 NaN
2 cant log in to vpn 87 \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail... eylqgodm ybqkwiam GRP_0 11 NaN
3 unable to access hr_tool page 29 unable to access hr_tool page xbkucsvz gcpydteq GRP_0 5 NaN
4 skype error 12 skype error owlgqjme qhcozdfx GRP_0 2 NaN
... ... ... ... ... ... ... ...
8495 emails not coming in from zz mail 141 \r\n\r\nreceived from: avglmrts.vhqmtiua@gmail... avglmrts vhqmtiua GRP_29 19 NaN
8496 telephony_software issue 24 telephony_software issue rbozivdq gmlhrtvp GRP_0 2 NaN
8497 vip2: windows password reset for tifpdchb pedx... 50 vip2: windows password reset for tifpdchb pedx... oybwdsgx oxyhwrfz GRP_0 7 NaN
8498 machine não está funcionando 103 i am unable to access the machine utilities to... ufawcgob aowhxjky GRP_62 17 NaN
8499 an mehreren pc`s lassen sich verschiedene prgr... 82 an mehreren pc`s lassen sich verschiedene prgr... kqvbrspl jyzoklfx GRP_49 11 NaN

8500 rows × 7 columns

In [93]:
rules_applied_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8500 entries, 0 to 8499
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  8500 non-null   object
 1   desc_len           8500 non-null   int64 
 2   Description        8500 non-null   object
 3   Caller             8500 non-null   object
 4   Assignment group   8500 non-null   object
 5   desc_word_count    8500 non-null   int64 
 6   pred_group         301 non-null    object
dtypes: int64(2), object(5)
memory usage: 465.0+ KB
In [94]:
rules_applied_df = rules_applied_df[(rules_applied_df['pred_group'].isna())]
rules_applied_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 0 to 8499
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  8199 non-null   object
 1   desc_len           8199 non-null   int64 
 2   Description        8199 non-null   object
 3   Caller             8199 non-null   object
 4   Assignment group   8199 non-null   object
 5   desc_word_count    8199 non-null   int64 
 6   pred_group         0 non-null      object
dtypes: int64(2), object(5)
memory usage: 512.4+ KB
In [95]:
assignment_group_count=rules_applied_df['Assignment group'].value_counts()
assignment_group_count.describe()
Out[95]:
count      62.000000
mean      132.241935
std       488.873469
min         1.000000
25%        12.250000
50%        33.000000
75%        99.250000
max      3833.000000
Name: Assignment group, dtype: float64

Concatenate Short Description and Description Column into New Description, drop the previous columns

In [96]:
#Concatenate Short Description and Description columns
rules_applied_df['New Description'] = rules_applied_df['Description'] + ' ' +rules_applied_df['Short description']

clean_data=rules_applied_df.drop(['Short description', 'Description', 'pred_group', 'desc_len', 'desc_word_count'], axis=1)
In [97]:
clean_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8199 entries, 0 to 8499
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Caller            8199 non-null   object
 1   Assignment group  8199 non-null   object
 2   New Description   8199 non-null   object
dtypes: object(3)
memory usage: 256.2+ KB

Fixing Garbled Text/ Mojibake using ftfy library

In [98]:
# Write a function to apply to the dataset to detect Mojibakes
def is_mojibake_impacted(text):
    if not badness.sequence_weirdness(text):
        # nothing weird, should be okay
        return True
    try:
        text.encode('sloppy-windows-1252')
    except UnicodeEncodeError:
        # Not CP-1252 encodable, probably fine
        return True
    else:
        # Encodable as CP-1252, Mojibake alert level high
        return False
# Check the dataset for mojibake impact
clean_data[~clean_data.iloc[:,:].applymap(is_mojibake_impacted).all(1)]
Out[98]:
Caller Assignment group New Description
99 ecprjbod litmjwsy GRP_0 \n\nreceived from: ecprjbod.litmjwsy@gmail.com...
116 bgqpotek cuxakvml GRP_0 \r\n\r\nreceived from: bgqpotek.cuxakvml@gmail...
124 tvcdfqgp nrbcqwgj GRP_0 from: tvcdfqgp nrbcqwgj \nsent: friday, octobe...
164 tycludks cjofwigv GRP_0 \n\nreceived from: abcdri@company.com\n\nwindy...
170 fbvpcytz nokypgvx GRP_18 \n\nreceived from: fbvpcytz.nokypgvx@gmail.com...
... ... ... ...
8470 azxhejvq fyemlavd GRP_16 from: mikhghytr wafglhdrhjop \nsent: thursday,...
8471 xqyjztnm onfusvlz GRP_30 to 小贺,早上电脑开机开不出来 电...
8480 nlearzwi ukdzstwi GRP_9 \r\n\r\nreceived from: nlearzwi.ukdzstwi@gmail...
8498 ufawcgob aowhxjky GRP_62 i am unable to access the machine utilities to...
8499 kqvbrspl jyzoklfx GRP_49 an mehreren pc`s lassen sich verschiedene prgr...

820 rows × 3 columns

In [99]:
# Take an example of row# 8471 Short Desc and fix it
print('Grabled text: \033[1m%s\033[0m\nFixed text: \033[1m%s\033[0m' % (clean_data['New Description'][8471], 
                                                                        fix_text(clean_data['New Description'][8471])))

# List all mojibakes defined in ftfy library
print('\nMojibake Symbol RegEx:\n', badness.MOJIBAKE_SYMBOL_RE.pattern)
Grabled text: to 小贺,早上电脑开机开不出来 电脑开机开不出来
Fixed text: to 小贺,早上电脑开机开不出来 电脑开机开不出来

Mojibake Symbol RegEx:
 [ÂÃÎÏÐÑØÙĂĎĐŃŘŮ][€-Ÿ€ƒ‚„†‡ˆ‰‹Œ“•˜œŸ¡¢£¤¥¦§¨ª«¬¯°±²³µ¶·¸¹º¼½¾¿ˇ˘˝]|[ÂÃÎÏÐÑØÙĂĎĐŃŘŮ][›»‘”´©™]\w|×[€-Ÿƒ‚„†‡ˆ‰‹Œ“•˜œŸ¡¦§¨ª«¬¯°²³ˇ˘›‘”´©™]|[¬√][ÄÅÇÉÑÖÜáàâäãåçéèêëíìîïñúùûü†¢£§¶ß®©™≠ÆØ¥ªæø≤≥]|\w√[±∂]\w|◊|[ðđ][ŸŸ]|â€|вЂ[љћ¦°№™ќ“”]
In [100]:
# Sanitize the dataset from Mojibakes
clean_data['New Description'] = clean_data['New Description'].apply(fix_text)

# Visualize that row# 8471
clean_data.loc[8471]
Out[100]:
Caller                      xqyjztnm onfusvlz
Assignment group                       GRP_30
New Description     to 小贺,早上电脑开机开不出来 电脑开机开不出来
Name: 8471, dtype: object

Cleaning & Processing the data

In [101]:
def date_validity(date_str):
    try:
        parser.parse(date_str)
        return True
    except:
        return False
In [102]:
def process(text_string):
    text=text_string.lower()
    text_string = ' '.join([w for w in text_string.split() if not date_validity(w)])
    text_string = re.sub(r"received from:",'',text_string)
    text_string = re.sub(r"from:",' ',text_string)
    text_string = re.sub(r"to:",' ',text_string)
    text_string = re.sub(r"subject:",' ',text_string)
    text_string = re.sub(r"sent:",' ',text_string)
    text_string = re.sub(r"ic:",' ',text_string)
    text_string = re.sub(r"cc:",' ',text_string)
    text_string = re.sub(r"bcc:",' ',text_string)
    text_string = re.sub(r'\S*@\S*\s?', '', text_string)
    text_string = re.sub(r'\d+','' ,text_string)
    text_string = re.sub(r'\n',' ',text_string)
    text_string = re.sub(r'#','', text_string)
    text_string = re.sub(r'&;?', 'and',text_string)
    text_string = re.sub(r'\&\w*;', '', text_string)
    text_string = re.sub(r'https?:\/\/.*\/\w*', '', text_string)  
    #text_string= ''.join(c for c in text_string if c <= '\uFFFF') 
    text_string = text_string.strip()
    #text_string = ' '.join(re.sub("[^\u0030-\u0039\u0041-\u005a\u0061-\u007a]", " ", text_string).split())
    text_string = re.sub(r"\s+[a-zA-Z]\s+", ' ', text_string)
    text_string = re.sub(' +', ' ', text_string)
    text_string = text_string.replace(r'\b\w\b','').replace(r'\s+', ' ')
    text_string = text_string.strip()
    return text_string
In [103]:
clean_data["Clean_Description"] = clean_data["New Description"].apply(process)
In [104]:
clean_data
Out[104]:
Caller Assignment group New Description Clean_Description
0 spxjnwir pjlcoqds GRP_0 -verified user details.(employee# & manager na... -verified user details.(employee and manager n...
1 hmjdrvpb komuaywn GRP_0 \n\nreceived from: hmjdrvpb.komuaywn@gmail.com... hello team, my meetings/skype meetings etc are...
2 eylqgodm ybqkwiam GRP_0 \n\nreceived from: eylqgodm.ybqkwiam@gmail.com... hi cannot log on to vpn best cant log in to vpn
3 xbkucsvz gcpydteq GRP_0 unable to access hr_tool page unable to access... unable to access hr_tool page unable to access...
4 owlgqjme qhcozdfx GRP_0 skype error skype error skype error skype error
... ... ... ... ...
8495 avglmrts vhqmtiua GRP_29 \n\nreceived from: avglmrts.vhqmtiua@gmail.com... good afternoon, am not receiving the emails th...
8496 rbozivdq gmlhrtvp GRP_0 telephony_software issue telephony_software issue telephony_software issue telephony_software issue
8497 oybwdsgx oxyhwrfz GRP_0 vip2: windows password reset for tifpdchb pedx... vip: windows password reset for tifpdchb pedxr...
8498 ufawcgob aowhxjky GRP_62 i am unable to access the machine utilities to... i am unable to access the machine utilities to...
8499 kqvbrspl jyzoklfx GRP_49 an mehreren pc`s lassen sich verschiedene prgr... an mehreren pc`s lassen sich verschiedene prgr...

8199 rows × 4 columns

Language Translation

Load the consolidated final translated pickle file which contains the language translations. The Process used for language translation is commented below

In [105]:
with open('/content/drive/MyDrive/Capstone/Final_Translated_combined.pkl','rb') as f:
#with open('Final_Translated_combined.pkl','rb') as f:
    clean_data = pickle.load(f)
In [106]:
clean_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8466 entries, 0 to 48
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Caller             8466 non-null   object
 1   Assignment group   8466 non-null   object
 2   New Description    8466 non-null   object
 3   Clean_Description  8466 non-null   object
 4   language           8466 non-null   object
 5   Translated Text    8466 non-null   object
dtypes: object(6)
memory usage: 463.0+ KB
In [107]:
clean_data.tail()
Out[107]:
Caller Assignment group New Description Clean_Description language Translated Text
44 wgmqlnzh vpebwoat GRP_30 早上开机后显示器不出图像。 显示器不亮 早上开机后显示器不出图像。 显示器不亮 zh-cn The display does not appear in the morning. Di...
45 rtjwbuev gfpwdetq GRP_31 prtSID_737--文件无法打印到打印机,提示打印机错误。 文件无法打印到打印机,提示打... prtSID_--文件无法打印到打印机,提示打印机错误。 文件无法打印到打印机,提示打印机错误。 zh-cn The prtsid _- file cannot be printed to the pr...
46 fupikdoa gjkytoeh GRP_48 客户提供的在线送货单生成系统打不开,需尽快解决 客户提供的在线系统打不开 客户提供的在线送货单生成系统打不开,需尽快解决 客户提供的在线系统打不开 zh-cn The online delivery unit provided by the custo...
47 kyagjxdh dmtjpbnz GRP_30 进行采购时显示"找不到员工1111154833的数据,请通知系统管理员" erp无法进行采购... 进行采购时显示"找不到员工的数据,请通知系统管理员" erp无法进行采购(转给贺正平) zh-cn Show "Data from the employee, please notify th...
48 xqyjztnm onfusvlz GRP_30 to 小贺,早上电脑开机开不出来 电脑开机开不出来 to 小贺,早上电脑开机开不出来 电脑开机开不出来 zh-cn To small congratulations, the computer does no...
In [108]:
assignment_group_cnt=clean_data['Assignment group'].value_counts()
assignment_group_cnt.describe()
Out[108]:
count      43.000000
mean      196.883721
std       596.778064
min        16.000000
25%        31.000000
50%        68.000000
75%       145.500000
max      3941.000000
Name: Assignment group, dtype: float64

Data Augmentation

In [109]:
!pip3 install nltk
import nltk 
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import wordnet
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.15.0)
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
In [110]:
from collections import OrderedDict
from nltk.tokenize import word_tokenize
def find_synonyms(word):
  synonyms = []
  for synset in wordnet.synsets(word):
    for syn in synset.lemma_names():
      synonyms.append(syn)

  # using this to drop duplicates while maintaining word order (closest synonyms comes first)
  synonyms_without_duplicates = list(OrderedDict.fromkeys(synonyms))
  return synonyms_without_duplicates
In [111]:
def create_set_of_new_sentences(sentence, max_syn_per_word = 3):
  count = 0
  new_sentences = []
  for word in word_tokenize(sentence):
    if len(word)<=3 : continue 
    for synonym in find_synonyms(word)[0:max_syn_per_word]:
      synonym = synonym.replace('_', ' ') #restore space character
      new_sentence = sentence.replace(word,synonym)
      if count <= 4:
        new_sentences.append(new_sentence)
        count += 1    
  return new_sentences
In [112]:
#Create a new dataframe with records not in GRP_0
new_dataframe = clean_data[clean_data["Assignment group"] != 'GRP_0']
zero_dataframe = clean_data[clean_data["Assignment group"] == 'GRP_0']
new_dataframe.head()
Out[112]:
Caller Assignment group New Description Clean_Description language Translated Text
6 jyoqwxhz clhxsoqy GRP_1 event: critical:HostName_221.company.com the v... event: critical:HostName_.company.com the valu... en event: critical:HostName_.company.com the valu...
17 sigfdwcj reofwzlm GRP_3 when undocking pc , screen will not come back ... when undocking pc , screen will not come back ... en When undocking pc , screen want distress come ...
32 kxsceyzo naokumlb GRP_4 \n\nreceived from: kxsceyzo.naokumlb@gmail.com... gentles, have two devices that are trying to s... en gentles, have two devices did are trying to sh...
43 yisohglr uvteflgb GRP_5 \n\nreceived from: yisohglr.uvteflgb@gmail.com... hi - the printer printer is not working and ne... en Hi - the printer printer is distress working a...
47 bpctwhsn kzqsbmtp GRP_6 received from: monitoring_tool@company.com\n\n... job Job_ failed in job_scheduler at: job Job_ ... en job Job_ failed in job_scheduler at: job Job_ ...
In [113]:
new_dataframe.shape, clean_data.shape
Out[113]:
((4525, 6), (8466, 6))
In [114]:
maxsyn=1
new_dataframe["Augmented_data"] = new_dataframe.apply(lambda x: create_set_of_new_sentences(x['Translated Text'], maxsyn),axis=1)
new_dataframe
Out[114]:
Caller Assignment group New Description Clean_Description language Translated Text Augmented_data
6 jyoqwxhz clhxsoqy GRP_1 event: critical:HostName_221.company.com the v... event: critical:HostName_.company.com the valu... en event: critical:HostName_.company.com the valu... [event: critical:HostName_.company.com the val...
17 sigfdwcj reofwzlm GRP_3 when undocking pc , screen will not come back ... when undocking pc , screen will not come back ... en When undocking pc , screen want distress come ... [When undock pc , screen want distress come ba...
32 kxsceyzo naokumlb GRP_4 \n\nreceived from: kxsceyzo.naokumlb@gmail.com... gentles, have two devices that are trying to s... en gentles, have two devices did are trying to sh... [pacify, have two devices did are trying to sh...
43 yisohglr uvteflgb GRP_5 \n\nreceived from: yisohglr.uvteflgb@gmail.com... hi - the printer printer is not working and ne... en Hi - the printer printer is distress working a... [Hi - the printer printer is distress working ...
47 bpctwhsn kzqsbmtp GRP_6 received from: monitoring_tool@company.com\n\n... job Job_ failed in job_scheduler at: job Job_ ... en job Job_ failed in job_scheduler at: job Job_ ... [job Job_ fail in job_scheduler at: job Job_ f...
... ... ... ... ... ... ... ...
44 wgmqlnzh vpebwoat GRP_30 早上开机后显示器不出图像。 显示器不亮 早上开机后显示器不出图像。 显示器不亮 zh-cn The display does not appear in the morning. Di... [The display does not appear in the morning. D...
45 rtjwbuev gfpwdetq GRP_31 prtSID_737--文件无法打印到打印机,提示打印机错误。 文件无法打印到打印机,提示打... prtSID_--文件无法打印到打印机,提示打印机错误。 文件无法打印到打印机,提示打印机错误。 zh-cn The prtsid _- file cannot be printed to the pr... [The prtsid _- file cannot be printed to the p...
46 fupikdoa gjkytoeh GRP_48 客户提供的在线送货单生成系统打不开,需尽快解决 客户提供的在线系统打不开 客户提供的在线送货单生成系统打不开,需尽快解决 客户提供的在线系统打不开 zh-cn The online delivery unit provided by the custo... [The on-line delivery unit provided by the cus...
47 kyagjxdh dmtjpbnz GRP_30 进行采购时显示"找不到员工1111154833的数据,请通知系统管理员" erp无法进行采购... 进行采购时显示"找不到员工的数据,请通知系统管理员" erp无法进行采购(转给贺正平) zh-cn Show "Data from the employee, please notify th... [show "Data from the employee, please notify t...
48 xqyjztnm onfusvlz GRP_30 to 小贺,早上电脑开机开不出来 电脑开机开不出来 to 小贺,早上电脑开机开不出来 电脑开机开不出来 zh-cn To small congratulations, the computer does no... [To small congratulations, the computer does n...

4525 rows × 7 columns

In [115]:
s = new_dataframe.apply(lambda x: pd.Series(x['Augmented_data']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'Final_Text'
new_dataframe_aug = new_dataframe.drop(['New Description','Augmented_data', 'Clean_Description', 'Translated Text'],axis=1).join(s)
new_dataframe_aug
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: DeprecationWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

Out[115]:
Caller Assignment group language Final_Text
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been rele...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happen again The PC has been releas...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been rele...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has be releas...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been let ...
... ... ... ... ...
8498 ufawcgob aowhxjky GRP_62 en i at the unable to access the machine utilitie...
8498 ufawcgob aowhxjky GRP_62 en i at the unable to entree the machine utilitie...
8498 ufawcgob aowhxjky GRP_62 en i at the unable to access the machine utilitie...
8498 ufawcgob aowhxjky GRP_62 en i at the unable to access the machine utility ...
8498 ufawcgob aowhxjky GRP_62 en i at the unable to access the machine utilitie...

23537 rows × 4 columns

In [116]:
#dataframes=[clean_data_aug1,clean_data_aug2,clean_data_aug3,clean_aug4]
#dataframes=[clean_data_aug1,clean_data_aug2,clean_data_aug3]
zero_dataframe = zero_dataframe.rename(columns={"Translated Text": "Final_Text"})
zero_dataframe = zero_dataframe.drop(['New Description', 'Clean_Description'], axis = 1)
dataframes=[new_dataframe_aug, zero_dataframe]
clean_data_result= pd.concat(dataframes)
clean_data_result
Out[116]:
Caller Assignment group language Final_Text
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been rele...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happen again The PC has been releas...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been rele...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has be releas...
0 vrfpyjwi nzhvgqiw GRP_24 de hello it's happened again The PC has been let ...
... ... ... ... ...
87 gasbfqvp fmvqgjih GRP_0 de On my part, the password was incorrectly enter...
97 nizholae bjnqikym GRP_0 de Stephryhan Needs Access to Below Collaboration...
100 bmhrsxlf ukatbwyi GRP_0 de benefits issue benefits issue
101 sjxhcyrq iupxtjcf GRP_0 de Security Error in travel expenses Billing Prog...
104 wfbkucds qaxhbois GRP_0 de I no longer know my ERP password and have fail...

27478 rows × 4 columns

In [117]:
# Assignment group distribution
print('\033[1mTotal assignment groups:\033[0m', clean_data_result['Assignment group'].nunique())

# Histogram
clean_data_result['Assignment group'].iplot(
    kind='hist',
    xTitle='Assignment Group',
    yTitle='count',
    title='Assignment Group Distribution- Histogram (Fig-5)')
Total assignment groups: 43
In [118]:
# Serialize the Augmented dataset for later use
clean_data_result.to_csv('Interim_data.csv', index=False, encoding='utf_8_sig')
with open('/content/Interim_data.pkl','wb') as f:
#with open('Interim_data.pkl','wb') as f:
    pickle.dump(clean_data_result, f, pickle.HIGHEST_PROTOCOL)

Stop words removal and Lemmatise text

In [119]:
clean_data_result.isnull().sum()
Out[119]:
Caller                0
Assignment group      0
language              0
Final_Text          197
dtype: int64
In [120]:
clean_data_result['Final_Text'] = clean_data_result['Final_Text'].fillna("")
In [121]:
import re
import string
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english')) 

processed_all_documents = list()

for desc in clean_data_result['Final_Text']:
    word_tokens = word_tokenize(desc) 
    
    filtered_sentence = [] 

    # Removing Stopwords
    for w in word_tokens: 
        if w not in stop_words: 
            filtered_sentence.append(w) 

    words = ' '.join(filtered_sentence)
    processed_all_documents.append(words)  
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [122]:
clean_data_result['Final_Text'] = processed_all_documents
In [123]:
clean_data_result.head(50)
Out[123]:
Caller Assignment group language Final_Text
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happen The PC released repeated times...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC let go repeated times...
0 vrfpyjwi nzhvgqiw GRP_24 de hello Ben Tige Number Block Keyboard R Left Ha...
0 vrfpyjwi nzhvgqiw GRP_24 de Hello Ben Tige number Block Keyboard R Left Ha...
0 vrfpyjwi nzhvgqiw GRP_24 de Hello Ben Tige Number block Keyboard R Left Ha...
0 vrfpyjwi nzhvgqiw GRP_24 de Hello Ben Tige Number Block keyboard R Left Ha...
0 vrfpyjwi nzhvgqiw GRP_24 de Hello Ben Tige Number Block Keyboard R left Ha...
0 vrfpyjwi nzhvgqiw GRP_24 de IE browser opens CRM system , prompted user ca...
0 vrfpyjwi nzhvgqiw GRP_24 de After IE browser opens CRM system , prompted u...
0 vrfpyjwi nzhvgqiw GRP_24 de After IE browser open CRM system , prompted us...
0 vrfpyjwi nzhvgqiw GRP_24 de After IE browser opens CRM system , prompted u...
0 vrfpyjwi nzhvgqiw GRP_24 de After IE browser opens CRM system , motivate u...
0 wtgbdjzl coliybmq GRP_24 de hello 's happened The PC released repeated tim...
0 wtgbdjzl coliybmq GRP_24 de hello 's happen The PC released repeated times...
0 wtgbdjzl coliybmq GRP_24 de hello 's happened The PC released repeated tim...
0 wtgbdjzl coliybmq GRP_24 de hello 's happened The PC released repeated tim...
0 wtgbdjzl coliybmq GRP_24 de hello 's happened The PC let go repeated times...
0 wtgbdjzl coliybmq GRP_24 de hello Ben Tige Number Block Keyboard R Left Ha...
0 wtgbdjzl coliybmq GRP_24 de Hello Ben Tige number Block Keyboard R Left Ha...
0 wtgbdjzl coliybmq GRP_24 de Hello Ben Tige Number block Keyboard R Left Ha...
0 wtgbdjzl coliybmq GRP_24 de Hello Ben Tige Number Block keyboard R Left Ha...
0 wtgbdjzl coliybmq GRP_24 de Hello Ben Tige Number Block Keyboard R left Ha...
0 wtgbdjzl coliybmq GRP_24 de IE browser opens CRM system , prompted user ca...
0 wtgbdjzl coliybmq GRP_24 de After IE browser opens CRM system , prompted u...
0 wtgbdjzl coliybmq GRP_24 de After IE browser open CRM system , prompted us...
0 wtgbdjzl coliybmq GRP_24 de After IE browser opens CRM system , prompted u...
0 wtgbdjzl coliybmq GRP_24 de After IE browser opens CRM system , motivate u...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello 's happened The PC released repeated tim...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello 's happen The PC released repeated times...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello 's happened The PC released repeated tim...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello 's happened The PC released repeated tim...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello 's happened The PC let go repeated times...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn hello Ben Tige Number Block Keyboard R Left Ha...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn Hello Ben Tige number Block Keyboard R Left Ha...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn Hello Ben Tige Number block Keyboard R Left Ha...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn Hello Ben Tige Number Block keyboard R Left Ha...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn Hello Ben Tige Number Block Keyboard R left Ha...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn IE browser opens CRM system , prompted user ca...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn After IE browser opens CRM system , prompted u...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn After IE browser open CRM system , prompted us...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn After IE browser opens CRM system , prompted u...
0 cjnlsbkq ocxnrewb GRP_31 zh-cn After IE browser opens CRM system , motivate u...
1 lpnzjimy mwtvondq GRP_25 de current entered EU Tool Error Runtime Error EU...
1 lpnzjimy mwtvondq GRP_25 de Currents enter EU Tool Error Runtime Error EU ...
1 lpnzjimy mwtvondq GRP_25 de Currents entered EU tool Error Runtime Error E...
1 lpnzjimy mwtvondq GRP_25 de Currents entered EU Tool mistake Runtime mista...
1 lpnzjimy mwtvondq GRP_25 de Currents entered EU Tool mistake Runtime mista...
In [124]:
clean_data_result.dropna()
Out[124]:
Caller Assignment group language Final_Text
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happen The PC released repeated times...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC released repeated tim...
0 vrfpyjwi nzhvgqiw GRP_24 de hello 's happened The PC let go repeated times...
... ... ... ... ...
87 gasbfqvp fmvqgjih GRP_0 de On part , password incorrectly entered please ...
97 nizholae bjnqikym GRP_0 de Stephryhan Needs Access Below Collaboration Pl...
100 bmhrsxlf ukatbwyi GRP_0 de benefits issue benefits issue
101 sjxhcyrq iupxtjcf GRP_0 de Security Error travel expenses Billing Program...
104 wfbkucds qaxhbois GRP_0 de I longer know ERP password failed attempts go ...

27478 rows × 4 columns

In [125]:
clean_data_result.isnull().sum()
Out[125]:
Caller              0
Assignment group    0
language            0
Final_Text          0
dtype: int64
In [126]:
clean_data_result['Final_Text'] = clean_data_result['Final_Text'].replace(np.nan, '', regex=True)
In [127]:
#Lemmatisation using spacy library
!pip install spacy
Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (2.2.4)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (4.41.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (2.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (3.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (0.8.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.18.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (2.23.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy) (50.3.2)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (7.4.0)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy) (3.1.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy) (3.4.0)
In [128]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0MB)
     |████████████████████████████████| 12.1MB 250kB/s 
Collecting spacy<2.4.0,>=2.3.0
  Downloading https://files.pythonhosted.org/packages/e5/bf/ca7bb25edd21f1cf9d498d0023808279672a664a70585e1962617ca2740c/spacy-2.3.5-cp36-cp36m-manylinux2014_x86_64.whl (10.4MB)
     |████████████████████████████████| 10.4MB 9.1MB/s 
Collecting thinc<7.5.0,>=7.4.1
  Downloading https://files.pythonhosted.org/packages/c0/1a/c3e4ab982214c63d743fad57c45c5e68ee49e4ea4384d27b28595a26ad26/thinc-7.4.5-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
     |████████████████████████████████| 1.1MB 55.8MB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (50.3.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.18.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (4.41.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (2.23.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (0.8.0)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.0.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (2.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (3.0.5)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (2020.12.5)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (3.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->en-core-web-sm==2.3.1) (3.4.0)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... done
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-cp36-none-any.whl size=12047110 sha256=c4fd302e7efc71b0dbe764f35ffe873c396e3a6ff9d8e11270a880c74edc4d72
  Stored in directory: /root/.cache/pip/wheels/2b/3f/41/f0b92863355c3ba34bb32b37d8a0c662959da0058202094f46
Successfully built en-core-web-sm
Installing collected packages: thinc, spacy, en-core-web-sm
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
  Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-2.3.1 spacy-2.3.5 thinc-7.4.5
In [129]:
!pip3 install spacy
Requirement already satisfied: spacy in /usr/local/lib/python3.6/dist-packages (2.3.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (0.8.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (2.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy) (3.0.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.1.3)
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /usr/local/lib/python3.6/dist-packages (from spacy) (7.4.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (2.23.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (4.41.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy) (50.3.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.18.5)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.0.5)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy) (0.4.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.24.3)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy) (3.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy) (3.4.0)
In [130]:
# Need to run "python -m spacy download en" in anaconda prompt to avoid 'en' not found issue.
In [131]:
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
def lemmatize_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

clean_data_result['Final_Text'] = clean_data_result['Final_Text'].apply(lemmatize_text)
In [132]:
clean_data_result
Out[132]:
Caller Assignment group language Final_Text
0 vrfpyjwi nzhvgqiw GRP_24 de hello be happen the pc release repeat time blu...
0 vrfpyjwi nzhvgqiw GRP_24 de hello be happen the pc release repeat time blu...
0 vrfpyjwi nzhvgqiw GRP_24 de hello be happen the pc release repeat time blu...
0 vrfpyjwi nzhvgqiw GRP_24 de hello be happen the pc release repeat time blu...
0 vrfpyjwi nzhvgqiw GRP_24 de hello be happen the pc let go repeat time blue...
... ... ... ... ...
87 gasbfqvp fmvqgjih GRP_0 de on part , password incorrectly enter please pa...
97 nizholae bjnqikym GRP_0 de Stephryhan need Access below Collaboration Pla...
100 bmhrsxlf ukatbwyi GRP_0 de benefit issue benefit issue
101 sjxhcyrq iupxtjcf GRP_0 de Security Error travel expense Billing ProgramD...
104 wfbkucds qaxhbois GRP_0 de -PRON- long know ERP password fail attempt go ...

27478 rows × 4 columns

In [133]:
# Serialize the translated dataset
clean_data_result.to_csv('Final_data.csv', index=False, encoding='utf_8_sig')
with open('/content/Final_data.pkl','wb') as f:
    pickle.dump(clean_data_result, f, pickle.HIGHEST_PROTOCOL)
In [134]:
# Load the translated pickle file 
with open('/content/Final_data.pkl','rb') as f:
    clean_data = pickle.load(f)

Univariate visualization

Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.

The distribution of Assignment groups

Plots how the assignments groups are scattered across the dataset. The bar chart, histogram and pie chart tells the frequency of any ticket assigned to any group OR the tickets count for each group.

In [135]:
# Assignment group distribution
print('\033[1mTotal assignment groups:\033[0m', clean_data['Assignment group'].nunique())

# Histogram
clean_data['Assignment group'].iplot(
    kind='hist',
    xTitle='Assignment Group',
    yTitle='count',
    title='Assignment Group Distribution- Histogram (Fig-1)')

# Pie chart
assgn_grp = pd.DataFrame(clean_data.groupby('Assignment group').size(),columns = ['Count']).reset_index()
assgn_grp.iplot(
    kind='pie', 
    labels='Assignment group', 
    values='Count', 
    title='Assignment Group Distribution- Pie Chart (Fig-2)', 
    hoverinfo="label+percent+name", hole=0.25)
Total assignment groups: 43

Lets visualize the percentage of incidents per assignment group

In [136]:
# Plot to visualize the percentage data distribution across different groups
sns.set(style="whitegrid")
plt.figure(figsize=(20,5))
ax = sns.countplot(x="Assignment group", data=clean_data, order=clean_data["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
for p in ax.patches:
  ax.annotate(str(format(p.get_height()/len(clean_data.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')
In [137]:
top_20 = clean_data['Assignment group'].value_counts().nlargest(20).reset_index()
In [138]:
plt.figure(figsize=(12,6))
bars = plt.bar(top_20['index'],top_20['Assignment group'])
plt.title('Top 20 Assignment groups with highest number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')

for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
In [139]:
bottom_20 = clean_data['Assignment group'].value_counts().nsmallest(20).reset_index()
In [140]:
plt.figure(figsize=(12,6))
bars = plt.bar(bottom_20['index'],bottom_20['Assignment group'])
plt.title('Bottom 20 Assignment groups with small number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()

The distribution of Callers

Plots how the callers are associated with tickets and what are the assignment groups they most frequently raise tickets for.

In [141]:
# Find out top 10 callers in terms of frequency of raising tickets in the entire dataset
print('\033[1mTotal caller count:\033[0m', clean_data['Caller'].nunique())
df = pd.DataFrame(clean_data.groupby(['Caller']).size().nlargest(10), columns=['Count']).reset_index()
df.iplot(kind='pie',
         labels='Caller', 
         values='Count', 
         title='Top 10 caller- Pie Chart (Fig-7)',
         colorscale='-spectral',
         pull=[0,0,0,0,0.05,0.1,0.15,0.2,0.25,0.3])
Total caller count: 2836
In [142]:
# Top 5 callers in each assignment group
top_n = 5
s = clean_data['Caller'].groupby(clean_data['Assignment group']).value_counts()
caller_grp = pd.DataFrame(s.groupby(level=0).nlargest(top_n).reset_index(level=0, drop=True))
caller_grp.head(15)
Out[142]:
Caller
Assignment group Caller
GRP_0 fumkcsji sarmtlhy 132
rbozivdq gmlhrtvp 87
olckhmvx pcqobjnd 54
efbwiadp dicafxhv 45
mfeyouli ndobtzpw 13
GRP_1 jyoqwxhz clhxsoqy 35
jloygrwh acvztedi 20
spxqmiry zpwgoqju 15
bpctwhsn kzqsbmtp 12
kbnfxpsy gehxzayq 10
GRP_10 bpctwhsn kzqsbmtp 161
ihfkwzjd erbxoyqk 30
dizquolf hlykecxa 24
xfznctqa xstndbwa 24
byclpwmv esafrtbh 23

The distribution of description lengths

Plots the variation of length and word count of new description attribute

In [143]:
clean_data.insert(1, 'desc_len', clean_data['Final_Text'].astype(str).apply(len))
clean_data.insert(5, 'desc_word_count', clean_data['Final_Text'].apply(lambda x: len(str(x).split())))
clean_data.head()
Out[143]:
Caller desc_len Assignment group language Final_Text desc_word_count
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 105 GRP_24 de hello be happen the pc let go repeat time blue... 19
In [144]:
# Description text length
clean_data['desc_len'].iplot(
    kind='bar',
    xTitle='text length',
    yTitle='count',
    colorscale='-ylgn',
    title='Description Text Length Distribution (Fig-11)')

# Description word count
clean_data['desc_word_count'].iplot(
    kind='bar',
    xTitle='word count',
    linecolor='black',
    yTitle='count',
    colorscale='-bupu',
    title='Description Word Count Distribution (Fig-12)')

N-Grams

N-gram is a contiguous sequence of N items from a given sample of text or speech, in the fields of computational linguistics and probability. The items can be phonemes, syllables, letters, words or base pairs according to the application. N-grams are used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means 2-worded phrase, and trigram means 3-worded phrase.

We'll be using scikit-learn’s CountVectorizer function to derive n-grams and compare them before and after removing stop words. Stop words are a set of commonly used words in any language. We'll be using english corpus stopwords and extend it to include some business specific common words considered to be stop words in our case.

In [145]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer

# Extend the English Stop Wordss
STOP_WORDS = STOPWORDS.union({'yes','na','hi',
                              'receive','hello',
                              'regards','thanks',
                              'from','greeting',
                              'forward','reply',
                              'will','please',
                              'see','help','able'})

# Generic function to derive top N n-grams from the corpus
def get_top_n_ngrams(corpus, top_n=None, ngram_range=(1,1), stopwords=None):
    vec = CountVectorizer(ngram_range=ngram_range, 
                          stop_words=stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:top_n]

Top Unigrams

In [146]:
# Top 50 Unigrams before removing stop words
top_n = 50
ngram_range = (1,1)
uni_grams = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range)

df = pd.DataFrame(uni_grams, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black', 
    colorscale='piyg',
    title=f'Top {top_n} Unigrams in Final_Text')

# Top 50 Unigrams after removing stop words
uni_grams_sw = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range, stopwords=STOP_WORDS)

df = pd.DataFrame(uni_grams_sw, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black',
    colorscale='-piyg',
    title=f'Top {top_n} Unigrams in Final_Text without stop words')
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py:385: UserWarning:

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.

Top Bigrams

In [147]:
# Top 50 Bigrams before removing stop words
top_n = 50
ngram_range = (2,2)
bi_grams = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range)

df = pd.DataFrame(bi_grams, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black', 
    colorscale='piyg',
    title=f'Top {top_n} Bigrams in Final_Text')

# Top 50 Bigrams after removing stop words
bi_grams_sw = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range, stopwords=STOP_WORDS)

df = pd.DataFrame(bi_grams_sw, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black',
    colorscale='-piyg',
    title=f'Top {top_n} Bigrams in Final_Text without stop words')
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py:385: UserWarning:

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.

Top Trigrams

In [148]:
# Top 50 Trigrams before removing stop words
top_n = 50
ngram_range = (3,3)
tri_grams = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range)

df = pd.DataFrame(tri_grams, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black', 
    colorscale='piyg',
    title=f'Top {top_n} Trigrams in Final_Text')

# Top 50 Trigrams after removing stop words
tri_grams_sw = get_top_n_ngrams(clean_data.Final_Text, top_n, ngram_range, stopwords=STOP_WORDS)

df = pd.DataFrame(tri_grams_sw, columns = ['Final_Text' , 'count'])
df.groupby('Final_Text').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', 
    yTitle='Count', 
    linecolor='black',
    colorscale='-piyg',
    title=f'Top {top_n} Trigrams in Final_Text without stop words')
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py:385: UserWarning:

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.

Word Cloud

Let us attempt to visualize this as a word cloud for top three groups that has got maximum records. A word cloud enables us to visualize the data as cluster of words and each words displayed in different font size based on the number of occurences of that word . Basically; the bolder and bigger the word show up in the visualization, it implies its more often it’s mentioned within a given text compared to other words in the cloud and therefore would be more important for us.

Let's write a generic method to generate Word Clouds for both Short and Long Description columns.

In [149]:
# replace any single word character with a word boundary
#clean_data.Final_Text.str.replace(r'\b\w\b','').str.replace(r'\s+', ' ')
In [150]:
def generate_word_cloud(corpus):
        # Instantiate the wordcloud object
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    stopwords=STOP_WORDS,
                    # mask=mask,
                    min_font_size = 10).generate(corpus)

    # plot the WordCloud image                        
    plt.figure(figsize = (12, 12), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 

    plt.show()
In [151]:
# Word Cloud for all tickets assigned to GRP_0
generate_word_cloud(' '.join(clean_data[clean_data['Assignment group'] == 'GRP_0'].Final_Text.str.strip()))
In [152]:
# Word Cloud for all tickets assigned to GRP_8
generate_word_cloud(' '.join(clean_data[clean_data['Assignment group'] == 'GRP_8'].Final_Text.str.strip()))
In [153]:
# Word Cloud for all tickets assigned to GRP_25
generate_word_cloud(' '.join(clean_data[clean_data['Assignment group'] == 'GRP_25'].Final_Text.str.strip()))
In [154]:
# Generate wordcloud for Final_Text field
generate_word_cloud(' '.join(clean_data.Final_Text.str.strip()))

Prepping Dataframe for Model Building

In [155]:
'''# Create a target categorical column
clean_data['Assignment group OneHotEncoded'] = clean_data['Assignment group'].astype('category').cat.codes
clean_data.info()'''
Out[155]:
"# Create a target categorical column\nclean_data['Assignment group OneHotEncoded'] = clean_data['Assignment group'].astype('category').cat.codes\nclean_data.info()"
In [156]:
'''# Import OneHot encoder 
from sklearn.preprocessing import LabelBinarizer
from sklearn import preprocessing 
clean_data['Assignment group OneHotEncoded'] = np.nan
# OneHot_encoder object knows how to understand word labels. 
#onehot_encoder = preprocessing.OneHotEncoder() #categories=62
onehot_encoder = LabelBinarizer()
onehot_encoder.fit(clean_data['Assignment group'])
# Encode labels in column
#transformed = onehot_encoder.fit_transform(clean_data['Assignment group'])
#temp_df = pd.DataFrame(transformed, columns=onehot_encoder.get_feature_names())
transformed = onehot_encoder.transform(clean_data['Assignment group'])
temp_df = pd.DataFrame(transformed)
clean_data = pd.concat([clean_data, temp_df], axis=1)
#clean_data
#clean_data['Assignment group OneHotEncoded'].unique()
clean_data'''
Out[156]:
"# Import OneHot encoder \nfrom sklearn.preprocessing import LabelBinarizer\nfrom sklearn import preprocessing \nclean_data['Assignment group OneHotEncoded'] = np.nan\n# OneHot_encoder object knows how to understand word labels. \n#onehot_encoder = preprocessing.OneHotEncoder() #categories=62\nonehot_encoder = LabelBinarizer()\nonehot_encoder.fit(clean_data['Assignment group'])\n# Encode labels in column\n#transformed = onehot_encoder.fit_transform(clean_data['Assignment group'])\n#temp_df = pd.DataFrame(transformed, columns=onehot_encoder.get_feature_names())\ntransformed = onehot_encoder.transform(clean_data['Assignment group'])\ntemp_df = pd.DataFrame(transformed)\nclean_data = pd.concat([clean_data, temp_df], axis=1)\n#clean_data\n#clean_data['Assignment group OneHotEncoded'].unique()\nclean_data"
In [157]:
clean_data
Out[157]:
Caller desc_len Assignment group language Final_Text desc_word_count
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 106 GRP_24 de hello be happen the pc release repeat time blu... 18
0 vrfpyjwi nzhvgqiw 105 GRP_24 de hello be happen the pc let go repeat time blue... 19
... ... ... ... ... ... ...
87 gasbfqvp fmvqgjih 103 GRP_0 de on part , password incorrectly enter please pa... 15
97 nizholae bjnqikym 2028 GRP_0 de Stephryhan need Access below Collaboration Pla... 267
100 bmhrsxlf ukatbwyi 27 GRP_0 de benefit issue benefit issue 4
101 sjxhcyrq iupxtjcf 99 GRP_0 de Security Error travel expense Billing ProgramD... 12
104 wfbkucds qaxhbois 80 GRP_0 de -PRON- long know ERP password fail attempt go ... 14

27478 rows × 6 columns

In [158]:
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
  
# Encode labels in column 'species'. 
clean_data['Assignment group LabelEncoded']= label_encoder.fit_transform(clean_data['Assignment group']) 
  
clean_data['Assignment group LabelEncoded'].unique()
Out[158]:
array([17, 25, 18,  4, 35, 26, 24, 32, 21,  1,  8, 12, 27, 13,  6, 23,  2,
       22, 29,  5, 42, 36, 19, 34, 37, 40, 41, 10,  3,  7,  9, 11, 14, 15,
       16, 20, 28, 30, 31, 33, 38, 39,  0])
In [159]:
label_encoded_dict = dict(zip(clean_data['Assignment group'].unique(), clean_data['Assignment group LabelEncoded'].unique()))
len(label_encoded_dict)
Out[159]:
43

Feature Extraction : Bag of Words using CountVectorizer

In [160]:
from sklearn.feature_extraction.text import CountVectorizer

CV = CountVectorizer(max_features=2000)

X_BoW = CV.fit_transform(clean_data['Final_Text']).toarray()
y = clean_data['Assignment group LabelEncoded']

print("Shape of Input Feature :",np.shape(X_BoW))
print("Shape of Target Feature :",np.shape(y))
Shape of Input Feature : (27478, 2000)
Shape of Target Feature : (27478,)
In [161]:
# Splitting Train Test 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_BoW, y, test_size=0.3, random_state = 0, stratify=y)
print('\033[1mShape of the training set:\033[0m', X_train.shape, X_test.shape)
print('\033[1mShape of the test set:\033[0m', y_train.shape, y_test.shape)
Shape of the training set: (19234, 2000) (8244, 2000)
Shape of the test set: (19234,) (8244,)
In [162]:
def run_classification(estimator, X_train, X_test, y_train, y_test, arch_name=None, pipelineRequired=True, isDeepModel=False):
    # train the model
    clf = estimator

    if pipelineRequired :
        clf = Pipeline([('tfidf', TfidfTransformer()),
                     ('clf', estimator),
                     ])
      
    if isDeepModel :
        clf.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=25, batch_size=128,verbose=1,callbacks=call_backs(arch_name))
        # predict from the clasiffier
        y_pred = clf.predict(X_test)
        y_pred = np.argmax(y_pred, axis=1)
        y_train_pred = clf.predict(X_train)
        y_train_pred = np.argmax(y_train_pred, axis=1)
    else :
        clf.fit(X_train, y_train)
        # predict from the clasiffier
        y_pred = clf.predict(X_test)
        y_train_pred = clf.predict(X_train)
    
    print('Estimator:', clf)
    print('='*80)
    print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred) * 100))
    print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))
    print('='*80)
    print('Confusion matrix:\n %s' % (confusion_matrix(y_test, y_pred)))
    print('='*80)
    print('Classification report:\n %s' % (classification_report(y_test, y_pred)))
    

Logistic Regression

In [163]:
run_classification(LogisticRegression(), X_train, X_test, y_train, y_test)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)
================================================================================
Training accuracy: 69.90%
Testing accuracy: 63.56%
================================================================================
Confusion matrix:
 [[987   0   2 ...   1   2   5]
 [  3  14   0 ...   0   3   0]
 [  6   0 136 ...   0  14   0]
 ...
 [ 11   0   0 ...  82   0   0]
 [ 18   1   4 ...   0 528   0]
 [ 23   0   0 ...   0  89  78]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.64      0.84      0.72      1182
           1       0.58      0.30      0.40        46
           2       0.85      0.74      0.79       184
           3       0.86      0.50      0.63        48
           4       0.60      0.69      0.64       503
           5       0.81      0.78      0.80       225
           6       0.80      0.72      0.76       184
           7       0.89      0.71      0.79        58
           8       0.82      0.78      0.80       134
           9       0.83      0.81      0.82        37
          10       0.85      0.84      0.85       135
          11       0.76      0.71      0.73       310
          12       0.68      0.65      0.66       313
          13       1.00      0.47      0.64        58
          14       1.00      0.60      0.75        43
          15       0.92      0.72      0.80        46
          16       0.76      0.84      0.79        37
          17       0.45      0.72      0.55       814
          18       0.67      0.61      0.64       245
          19       0.76      0.59      0.67        91
          20       1.00      0.07      0.14        27
          21       0.71      0.39      0.51       107
          22       0.87      0.72      0.79       151
          23       0.69      0.70      0.70       298
          24       0.18      0.02      0.03       131
          25       0.76      0.17      0.28       129
          26       0.37      0.45      0.41       483
          27       0.56      0.42      0.48       138
          28       1.00      0.62      0.77        24
          29       0.88      0.66      0.75       151
          30       0.86      0.76      0.81        67
          31       0.95      0.95      0.95        60
          32       0.43      0.26      0.32       214
          33       0.90      0.57      0.69        46
          34       0.93      0.38      0.54        34
          35       0.09      0.01      0.02       105
          36       0.60      0.25      0.36       127
          37       0.74      0.47      0.58       196
          38       1.00      0.33      0.50        18
          39       0.88      0.24      0.38        29
          40       0.89      0.79      0.84       104
          41       0.63      0.80      0.70       662
          42       0.81      0.31      0.45       250

    accuracy                           0.64      8244
   macro avg       0.75      0.56      0.61      8244
weighted avg       0.65      0.64      0.62      8244

Naive Bayes Classifier

In [164]:
run_classification(MultinomialNB(), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)
================================================================================
Training accuracy: 54.41%
Testing accuracy: 50.63%
================================================================================
Confusion matrix:
 [[1106    0    1 ...    0    2    1]
 [   4    0    0 ...    0    7    0]
 [  33    0   82 ...    0   31    0]
 ...
 [  76    0    0 ...   18    0    0]
 [  25    0    0 ...    0  537    4]
 [  56    0    0 ...    0   89   62]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.40      0.94      0.56      1182
           1       0.00      0.00      0.00        46
           2       0.84      0.45      0.58       184
           3       0.67      0.08      0.15        48
           4       0.50      0.58      0.54       503
           5       0.62      0.78      0.69       225
           6       0.78      0.45      0.57       184
           7       1.00      0.14      0.24        58
           8       0.89      0.24      0.38       134
           9       0.00      0.00      0.00        37
          10       0.85      0.63      0.72       135
          11       0.79      0.57      0.66       310
          12       0.59      0.39      0.47       313
          13       1.00      0.10      0.19        58
          14       1.00      0.07      0.13        43
          15       1.00      0.02      0.04        46
          16       1.00      0.11      0.20        37
          17       0.44      0.68      0.53       814
          18       0.60      0.44      0.51       245
          19       0.96      0.26      0.41        91
          20       0.00      0.00      0.00        27
          21       0.39      0.10      0.16       107
          22       0.85      0.64      0.73       151
          23       0.84      0.50      0.63       298
          24       0.00      0.00      0.00       131
          25       1.00      0.06      0.12       129
          26       0.33      0.41      0.36       483
          27       0.45      0.20      0.27       138
          28       0.00      0.00      0.00        24
          29       0.96      0.36      0.52       151
          30       1.00      0.13      0.24        67
          31       1.00      0.52      0.68        60
          32       0.38      0.09      0.15       214
          33       0.80      0.09      0.16        46
          34       1.00      0.03      0.06        34
          35       0.14      0.01      0.02       105
          36       0.56      0.08      0.14       127
          37       0.69      0.38      0.49       196
          38       0.00      0.00      0.00        18
          39       0.00      0.00      0.00        29
          40       1.00      0.17      0.30       104
          41       0.58      0.81      0.68       662
          42       0.49      0.25      0.33       250

    accuracy                           0.51      8244
   macro avg       0.61      0.27      0.32      8244
weighted avg       0.57      0.51      0.47      8244

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

K-nearest Neighbor

In [165]:
run_classification(KNeighborsClassifier(), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=5, p=2,
                                      weights='uniform'))],
         verbose=False)
================================================================================
Training accuracy: 76.02%
Testing accuracy: 65.39%
================================================================================
Confusion matrix:
 [[784   2   5 ...   3  73  31]
 [  0  30   0 ...   0   0   0]
 [  1   2 134 ...   0  13   5]
 ...
 [  9   0   0 ...  77   0   0]
 [  0   0   4 ...   3 516   4]
 [  3   0   0 ...   0  90  86]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.84      0.66      0.74      1182
           1       0.67      0.65      0.66        46
           2       0.75      0.73      0.74       184
           3       0.81      0.81      0.81        48
           4       0.60      0.67      0.63       503
           5       0.88      0.83      0.85       225
           6       0.84      0.83      0.83       184
           7       0.94      0.78      0.85        58
           8       0.85      0.86      0.86       134
           9       0.65      1.00      0.79        37
          10       0.82      0.87      0.84       135
          11       0.84      0.84      0.84       310
          12       0.79      0.84      0.82       313
          13       0.88      0.88      0.88        58
          14       0.95      0.98      0.97        43
          15       0.91      0.67      0.78        46
          16       0.74      0.84      0.78        37
          17       0.49      0.62      0.55       814
          18       0.59      0.65      0.62       245
          19       0.85      0.88      0.86        91
          20       0.69      0.74      0.71        27
          21       0.56      0.55      0.56       107
          22       0.92      0.79      0.85       151
          23       0.82      0.84      0.83       298
          24       0.12      0.08      0.10       131
          25       0.38      0.34      0.36       129
          26       0.39      0.38      0.38       483
          27       0.50      0.55      0.53       138
          28       0.95      0.79      0.86        24
          29       0.91      0.77      0.84       151
          30       0.94      0.91      0.92        67
          31       0.97      0.95      0.96        60
          32       0.42      0.25      0.31       214
          33       0.95      0.80      0.87        46
          34       0.70      0.56      0.62        34
          35       0.10      0.05      0.07       105
          36       0.30      0.43      0.36       127
          37       0.51      0.51      0.51       196
          38       1.00      0.50      0.67        18
          39       0.73      0.66      0.69        29
          40       0.92      0.74      0.82       104
          41       0.56      0.78      0.65       662
          42       0.53      0.34      0.42       250

    accuracy                           0.65      8244
   macro avg       0.71      0.68      0.69      8244
weighted avg       0.66      0.65      0.65      8244

Support Vector Machine (SVM)

In [166]:
run_classification(LinearSVC(), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)
================================================================================
Training accuracy: 77.83%
Testing accuracy: 70.74%
================================================================================
Confusion matrix:
 [[929   0   3 ...   2   5   5]
 [  0  30   0 ...   0   3   0]
 [  0   0 148 ...   0  13   0]
 ...
 [  4   0   0 ...  92   0   0]
 [ 17   1   4 ...   0 543   2]
 [ 18   0   0 ...   0  89  85]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.79      0.79      0.79      1182
           1       0.67      0.65      0.66        46
           2       0.81      0.80      0.81       184
           3       0.98      0.90      0.93        48
           4       0.70      0.69      0.70       503
           5       0.94      0.89      0.92       225
           6       0.88      0.84      0.86       184
           7       0.97      1.00      0.98        58
           8       0.84      0.90      0.87       134
           9       0.86      1.00      0.92        37
          10       0.98      0.94      0.96       135
          11       0.82      0.86      0.84       310
          12       0.80      0.79      0.79       313
          13       0.98      0.91      0.95        58
          14       0.95      0.93      0.94        43
          15       0.95      0.91      0.93        46
          16       0.90      1.00      0.95        37
          17       0.52      0.70      0.60       814
          18       0.69      0.71      0.70       245
          19       0.80      0.87      0.83        91
          20       0.93      0.52      0.67        27
          21       0.61      0.61      0.61       107
          22       0.91      0.89      0.90       151
          23       0.80      0.84      0.82       298
          24       0.18      0.07      0.10       131
          25       0.57      0.30      0.40       129
          26       0.45      0.47      0.46       483
          27       0.55      0.60      0.58       138
          28       0.85      0.96      0.90        24
          29       0.90      0.80      0.85       151
          30       0.94      0.96      0.95        67
          31       0.95      1.00      0.98        60
          32       0.40      0.32      0.36       214
          33       0.93      0.83      0.87        46
          34       1.00      0.62      0.76        34
          35       0.11      0.08      0.09       105
          36       0.54      0.34      0.42       127
          37       0.77      0.55      0.64       196
          38       0.75      0.50      0.60        18
          39       0.93      0.93      0.93        29
          40       0.96      0.88      0.92       104
          41       0.63      0.82      0.72       662
          42       0.69      0.34      0.46       250

    accuracy                           0.71      8244
   macro avg       0.77      0.73      0.74      8244
weighted avg       0.71      0.71      0.70      8244

Decision Tree

In [167]:
run_classification(DecisionTreeClassifier(), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort='deprecated', random_state=None,
                                        splitter='best'))],
         verbose=False)
================================================================================
Training accuracy: 82.18%
Testing accuracy: 71.12%
================================================================================
Confusion matrix:
 [[894   0   5 ...   6  12   3]
 [  0  30   0 ...   0   3   0]
 [  6   0 143 ...   0  14   0]
 ...
 [  1   0   0 ...  89   0   0]
 [  0   0   4 ...   0 565   2]
 [  1   0   0 ...   0  97 100]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.89      0.76      0.82      1182
           1       0.62      0.65      0.64        46
           2       0.67      0.78      0.72       184
           3       0.79      0.85      0.82        48
           4       0.61      0.74      0.67       503
           5       0.91      0.90      0.91       225
           6       0.86      0.90      0.88       184
           7       0.92      0.95      0.93        58
           8       0.81      0.94      0.87       134
           9       0.79      1.00      0.88        37
          10       0.97      0.91      0.94       135
          11       0.85      0.92      0.88       310
          12       0.81      0.83      0.82       313
          13       0.93      0.90      0.91        58
          14       0.95      0.98      0.97        43
          15       0.81      0.96      0.88        46
          16       0.89      0.89      0.89        37
          17       0.50      0.66      0.57       814
          18       0.66      0.67      0.66       245
          19       0.83      0.76      0.79        91
          20       0.88      0.85      0.87        27
          21       0.55      0.59      0.57       107
          22       0.90      0.87      0.89       151
          23       0.86      0.87      0.87       298
          24       0.19      0.08      0.12       131
          25       0.71      0.38      0.49       129
          26       0.45      0.43      0.44       483
          27       0.54      0.58      0.56       138
          28       0.96      0.92      0.94        24
          29       0.90      0.89      0.90       151
          30       0.89      0.94      0.91        67
          31       0.98      1.00      0.99        60
          32       0.49      0.31      0.38       214
          33       0.81      0.85      0.83        46
          34       0.82      0.68      0.74        34
          35       0.11      0.05      0.07       105
          36       0.67      0.39      0.50       127
          37       0.80      0.55      0.65       196
          38       0.75      0.50      0.60        18
          39       0.86      0.86      0.86        29
          40       0.94      0.86      0.89       104
          41       0.64      0.85      0.73       662
          42       0.79      0.40      0.53       250

    accuracy                           0.71      8244
   macro avg       0.76      0.74      0.74      8244
weighted avg       0.72      0.71      0.70      8244

Random Forest

In [168]:
run_classification(RandomForestClassifier(n_estimators=100, random_state=0), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=0,
                                        verbose=0, warm_start=False))],
         verbose=False)
================================================================================
Training accuracy: 82.17%
Testing accuracy: 75.29%
================================================================================
Confusion matrix:
 [[1098    0    1 ...    1    6    0]
 [   0   30    0 ...    0    3    0]
 [   0    0  149 ...    0   13    0]
 ...
 [   0    0    0 ...   96    0    0]
 [   0    0    4 ...    0  571    2]
 [   1    0    0 ...    0   97  101]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94      1182
           1       0.83      0.65      0.73        46
           2       0.82      0.81      0.82       184
           3       0.84      0.96      0.89        48
           4       0.71      0.70      0.71       503
           5       0.99      0.91      0.95       225
           6       0.91      0.92      0.92       184
           7       1.00      1.00      1.00        58
           8       0.92      0.96      0.94       134
           9       0.93      1.00      0.96        37
          10       1.00      0.94      0.97       135
          11       0.95      0.97      0.96       310
          12       0.91      0.87      0.89       313
          13       0.98      0.91      0.95        58
          14       1.00      1.00      1.00        43
          15       0.98      0.98      0.98        46
          16       0.97      0.95      0.96        37
          17       0.52      0.64      0.57       814
          18       0.66      0.71      0.69       245
          19       0.96      0.84      0.89        91
          20       1.00      0.96      0.98        27
          21       0.64      0.62      0.63       107
          22       0.97      0.91      0.94       151
          23       0.96      0.93      0.94       298
          24       0.17      0.08      0.11       131
          25       0.51      0.43      0.47       129
          26       0.43      0.47      0.45       483
          27       0.58      0.68      0.63       138
          28       1.00      1.00      1.00        24
          29       0.96      0.91      0.94       151
          30       1.00      0.99      0.99        67
          31       0.98      1.00      0.99        60
          32       0.41      0.34      0.37       214
          33       0.98      0.87      0.92        46
          34       1.00      0.68      0.81        34
          35       0.11      0.09      0.09       105
          36       0.57      0.39      0.47       127
          37       0.77      0.55      0.64       196
          38       0.90      0.50      0.64        18
          39       0.88      0.97      0.92        29
          40       0.99      0.92      0.96       104
          41       0.64      0.86      0.74       662
          42       0.78      0.40      0.53       250

    accuracy                           0.75      8244
   macro avg       0.82      0.77      0.79      8244
weighted avg       0.76      0.75      0.75      8244

GradientBoosting

In [169]:
from sklearn.ensemble import GradientBoostingClassifier
run_classification(GradientBoostingClassifier(n_estimators=100, random_state=0), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 GradientBoostingClassifier(ccp_alpha=0.0,
                                            criterion='friedman_mse', init=None,
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                            max_leaf_nodes=None,
                                            min_impurity_decrease=0.0,
                                            min_impurity_split=None,
                                            min_samples_leaf=1,
                                            min_samples_split=2,
                                            min_weight_fraction_leaf=0.0,
                                            n_estimators=100,
                                            n_iter_no_change=None,
                                            presort='deprecated',
                                            random_state=0, subsample=1.0,
                                            tol=0.0001, validation_fraction=0.1,
                                            verbose=0, warm_start=False))],
         verbose=False)
================================================================================
Training accuracy: 76.33%
Testing accuracy: 66.91%
================================================================================
Confusion matrix:
 [[914   1   4 ...   5   1   9]
 [  0  30   0 ...   0   6   0]
 [  0   0 146 ...   0  13   4]
 ...
 [  2   0   0 ...  96   0   0]
 [  6   0   4 ...   0 536   1]
 [  4   0   0 ...   0  88  93]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.72      0.77      0.74      1182
           1       0.59      0.65      0.62        46
           2       0.79      0.79      0.79       184
           3       0.77      0.85      0.81        48
           4       0.69      0.66      0.67       503
           5       0.96      0.83      0.89       225
           6       0.90      0.86      0.88       184
           7       0.95      0.95      0.95        58
           8       0.71      0.86      0.78       134
           9       0.91      0.84      0.87        37
          10       0.97      0.92      0.94       135
          11       0.83      0.75      0.79       310
          12       0.83      0.71      0.76       313
          13       0.91      0.90      0.90        58
          14       0.93      0.93      0.93        43
          15       1.00      0.98      0.99        46
          16       0.58      0.38      0.46        37
          17       0.42      0.63      0.50       814
          18       0.67      0.67      0.67       245
          19       0.84      0.78      0.81        91
          20       0.56      0.56      0.56        27
          21       0.54      0.57      0.55       107
          22       0.87      0.81      0.84       151
          23       0.83      0.66      0.73       298
          24       0.20      0.10      0.13       131
          25       0.52      0.36      0.42       129
          26       0.45      0.39      0.41       483
          27       0.43      0.54      0.48       138
          28       0.84      0.88      0.86        24
          29       0.90      0.82      0.86       151
          30       0.89      0.93      0.91        67
          31       0.92      0.93      0.93        60
          32       0.43      0.35      0.38       214
          33       0.90      0.80      0.85        46
          34       0.62      0.44      0.52        34
          35       0.10      0.07      0.08       105
          36       0.58      0.40      0.47       127
          37       0.72      0.56      0.63       196
          38       0.82      0.50      0.62        18
          39       0.96      0.86      0.91        29
          40       0.84      0.92      0.88       104
          41       0.67      0.81      0.73       662
          42       0.78      0.37      0.50       250

    accuracy                           0.67      8244
   macro avg       0.73      0.68      0.70      8244
weighted avg       0.68      0.67      0.67      8244

XGBoosting

In [170]:
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.18.5)
In [171]:
from xgboost import XGBClassifier
run_classification(XGBClassifier(), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=100, n_jobs=1, nthread=None,
                               objective='multi:softprob', random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               seed=None, silent=None, subsample=1,
                               verbosity=1))],
         verbose=False)
================================================================================
Training accuracy: 69.08%
Testing accuracy: 61.90%
================================================================================
Confusion matrix:
 [[922   0   4 ...   2   0   8]
 [  2  26   0 ...   0   3   0]
 [  6   0 114 ...   0  14   4]
 ...
 [  9   0   0 ...  83   0   0]
 [  9   1   4 ...   0 524   0]
 [  9   0   0 ...   0  88  81]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.61      0.78      0.68      1182
           1       0.96      0.57      0.71        46
           2       0.80      0.62      0.70       184
           3       0.92      0.69      0.79        48
           4       0.61      0.62      0.62       503
           5       0.89      0.76      0.82       225
           6       0.89      0.68      0.77       184
           7       0.79      0.90      0.84        58
           8       0.72      0.84      0.78       134
           9       0.84      1.00      0.91        37
          10       0.89      0.86      0.87       135
          11       0.70      0.56      0.63       310
          12       0.76      0.54      0.63       313
          13       1.00      0.67      0.80        58
          14       0.97      0.77      0.86        43
          15       0.80      0.98      0.88        46
          16       0.90      1.00      0.95        37
          17       0.36      0.74      0.48       814
          18       0.67      0.60      0.63       245
          19       0.72      0.58      0.64        91
          20       0.92      0.41      0.56        27
          21       0.76      0.42      0.54       107
          22       0.87      0.74      0.80       151
          23       0.83      0.53      0.65       298
          24       0.50      0.02      0.04       131
          25       0.95      0.15      0.26       129
          26       0.43      0.40      0.41       483
          27       0.55      0.42      0.48       138
          28       1.00      0.92      0.96        24
          29       0.96      0.58      0.72       151
          30       0.81      0.93      0.86        67
          31       0.92      0.93      0.93        60
          32       0.46      0.25      0.32       214
          33       0.94      0.74      0.83        46
          34       0.95      0.56      0.70        34
          35       0.11      0.05      0.07       105
          36       0.50      0.27      0.35       127
          37       0.72      0.45      0.56       196
          38       0.90      0.50      0.64        18
          39       0.92      0.76      0.83        29
          40       0.89      0.80      0.84       104
          41       0.65      0.79      0.71       662
          42       0.80      0.32      0.46       250

    accuracy                           0.62      8244
   macro avg       0.77      0.62      0.66      8244
weighted avg       0.66      0.62      0.61      8244

Bagging

In [172]:
from sklearn.ensemble import BaggingClassifier
run_classification(BaggingClassifier(n_estimators=10, random_state=0), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 BaggingClassifier(base_estimator=None, bootstrap=True,
                                   bootstrap_features=False, max_features=1.0,
                                   max_samples=1.0, n_estimators=10,
                                   n_jobs=None, oob_score=False, random_state=0,
                                   verbose=0, warm_start=False))],
         verbose=False)
================================================================================
Training accuracy: 81.87%
Testing accuracy: 72.65%
================================================================================
Confusion matrix:
 [[994   0   2 ...   2   7   2]
 [  0  29   0 ...   0   3   0]
 [  2   0 147 ...   0  13   0]
 ...
 [  1   0   0 ...  92   0   0]
 [  1   0   4 ...   0 567   2]
 [  2   0   0 ...   0  97  95]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.89      0.84      0.87      1182
           1       0.76      0.63      0.69        46
           2       0.80      0.80      0.80       184
           3       0.85      0.96      0.90        48
           4       0.68      0.70      0.69       503
           5       0.94      0.88      0.91       225
           6       0.90      0.90      0.90       184
           7       0.95      0.97      0.96        58
           8       0.84      0.92      0.88       134
           9       0.80      0.97      0.88        37
          10       0.98      0.94      0.96       135
          11       0.87      0.95      0.91       310
          12       0.85      0.83      0.84       313
          13       0.98      0.91      0.95        58
          14       1.00      1.00      1.00        43
          15       0.90      1.00      0.95        46
          16       0.92      0.89      0.90        37
          17       0.52      0.62      0.57       814
          18       0.66      0.70      0.68       245
          19       0.94      0.82      0.88        91
          20       1.00      1.00      1.00        27
          21       0.57      0.62      0.59       107
          22       0.90      0.85      0.88       151
          23       0.90      0.90      0.90       298
          24       0.17      0.09      0.12       131
          25       0.60      0.37      0.46       129
          26       0.41      0.44      0.43       483
          27       0.58      0.64      0.60       138
          28       0.96      1.00      0.98        24
          29       0.91      0.91      0.91       151
          30       0.93      0.97      0.95        67
          31       0.98      1.00      0.99        60
          32       0.40      0.35      0.37       214
          33       0.98      0.87      0.92        46
          34       0.88      0.68      0.77        34
          35       0.10      0.08      0.09       105
          36       0.60      0.40      0.48       127
          37       0.74      0.55      0.63       196
          38       1.00      0.50      0.67        18
          39       0.85      0.97      0.90        29
          40       0.92      0.88      0.90       104
          41       0.65      0.86      0.74       662
          42       0.76      0.38      0.51       250

    accuracy                           0.73      8244
   macro avg       0.79      0.76      0.76      8244
weighted avg       0.73      0.73      0.72      8244

Stacking

In [175]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

estimators = [('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('svr', make_pipeline(StandardScaler(with_mean=False), LinearSVC(random_state=42)))]

run_classification(StackingClassifier(estimators=estimators, final_estimator=DecisionTreeClassifier()), X_train, X_test, y_train, y_test)
/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

/usr/local/lib/python3.6/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 StackingClassifier(cv=None,
                                    estimators=[('rf',
                                                 RandomForestClassifier(bootstrap=True,
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                        max_depth=None,
                                                                        max_features='auto',
                                                                        max_leaf_nodes=None,
                                                                        max_samples=None,
                                                                        min_impurity_decrease=0.0...
                                    final_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                           class_weight=None,
                                                                           criterion='gini',
                                                                           max_depth=None,
                                                                           max_features=None,
                                                                           max_leaf_nodes=None,
                                                                           min_impurity_decrease=0.0,
                                                                           min_impurity_split=None,
                                                                           min_samples_leaf=1,
                                                                           min_samples_split=2,
                                                                           min_weight_fraction_leaf=0.0,
                                                                           presort='deprecated',
                                                                           random_state=None,
                                                                           splitter='best'),
                                    n_jobs=None, passthrough=False,
                                    stack_method='auto', verbose=0))],
         verbose=False)
================================================================================
Training accuracy: 79.17%
Testing accuracy: 73.41%
================================================================================
Confusion matrix:
 [[1047    0    4 ...    0    0    0]
 [   0   33    0 ...    0    4    0]
 [   1    0  147 ...    0   13    0]
 ...
 [   1    0    0 ...   90    0    0]
 [   1    0    4 ...    0  551    0]
 [   8    0    0 ...    0   88   92]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90      1182
           1       0.72      0.72      0.72        46
           2       0.74      0.80      0.77       184
           3       0.85      0.96      0.90        48
           4       0.63      0.75      0.68       503
           5       0.95      0.92      0.94       225
           6       0.82      0.92      0.86       184
           7       0.98      1.00      0.99        58
           8       0.92      0.90      0.91       134
           9       0.82      1.00      0.90        37
          10       0.91      0.95      0.93       135
          11       0.95      0.95      0.95       310
          12       0.85      0.88      0.86       313
          13       0.93      0.95      0.94        58
          14       0.98      0.95      0.96        43
          15       0.96      0.96      0.96        46
          16       0.97      0.97      0.97        37
          17       0.51      0.64      0.57       814
          18       0.69      0.69      0.69       245
          19       0.87      0.82      0.85        91
          20       0.81      0.96      0.88        27
          21       0.64      0.61      0.62       107
          22       0.97      0.92      0.95       151
          23       0.97      0.91      0.94       298
          24       0.22      0.22      0.22       131
          25       0.49      0.43      0.46       129
          26       0.48      0.34      0.40       483
          27       0.59      0.54      0.56       138
          28       0.92      0.96      0.94        24
          29       0.90      0.91      0.91       151
          30       0.93      1.00      0.96        67
          31       0.98      1.00      0.99        60
          32       0.39      0.24      0.30       214
          33       1.00      0.80      0.89        46
          34       0.68      0.68      0.68        34
          35       0.21      0.16      0.18       105
          36       0.36      0.40      0.38       127
          37       0.70      0.56      0.62       196
          38       1.00      0.50      0.67        18
          39       1.00      0.90      0.95        29
          40       0.99      0.87      0.92       104
          41       0.68      0.83      0.75       662
          42       0.77      0.37      0.50       250

    accuracy                           0.73      8244
   macro avg       0.78      0.76      0.77      8244
weighted avg       0.74      0.73      0.73      8244

Voting

In [176]:
from sklearn.ensemble import VotingClassifier

estimators = [('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('dtc', DecisionTreeClassifier(random_state=42)), ('lsvc', LinearSVC(random_state=42))]

run_classification(VotingClassifier(estimators=estimators, voting='hard'), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 VotingClassifier(estimators=[('rf',
                                               RandomForestClassifier(bootstrap=True,
                                                                      ccp_alpha=0.0,
                                                                      class_weight=None,
                                                                      criterion='gini',
                                                                      max_depth=None,
                                                                      max_features='auto',
                                                                      max_leaf_nodes=None,
                                                                      max_samples=None,
                                                                      min_impurity_decrease=0.0,
                                                                      min_impur...
                                                                      min_weight_fraction_leaf=0.0,
                                                                      presort='deprecated',
                                                                      random_state=42,
                                                                      splitter='best')),
                                              ('lsvc',
                                               LinearSVC(C=1.0,
                                                         class_weight=None,
                                                         dual=True,
                                                         fit_intercept=True,
                                                         intercept_scaling=1,
                                                         loss='squared_hinge',
                                                         max_iter=1000,
                                                         multi_class='ovr',
                                                         penalty='l2',
                                                         random_state=42,
                                                         tol=0.0001,
                                                         verbose=0))],
                                  flatten_transform=True, n_jobs=None,
                                  voting='hard', weights=None))],
         verbose=False)
================================================================================
Training accuracy: 82.14%
Testing accuracy: 75.41%
================================================================================
Confusion matrix:
 [[1092    0    1 ...    1    5    1]
 [   0   31    0 ...    0    3    0]
 [   0    0  152 ...    0   13    0]
 ...
 [   0    0    0 ...   91    0    0]
 [   0    0    4 ...    0  571    2]
 [   1    0    0 ...    0   97  101]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93      1182
           1       0.65      0.67      0.66        46
           2       0.77      0.83      0.80       184
           3       0.88      0.96      0.92        48
           4       0.67      0.74      0.70       503
           5       0.96      0.92      0.94       225
           6       0.92      0.92      0.92       184
           7       1.00      1.00      1.00        58
           8       0.88      0.96      0.92       134
           9       0.82      1.00      0.90        37
          10       1.00      0.94      0.97       135
          11       0.93      0.97      0.95       310
          12       0.90      0.87      0.88       313
          13       0.96      0.91      0.94        58
          14       0.98      1.00      0.99        43
          15       0.96      0.98      0.97        46
          16       0.93      1.00      0.96        37
          17       0.51      0.66      0.58       814
          18       0.73      0.71      0.72       245
          19       0.93      0.82      0.87        91
          20       1.00      0.89      0.94        27
          21       0.60      0.61      0.60       107
          22       0.94      0.91      0.93       151
          23       0.94      0.93      0.93       298
          24       0.17      0.08      0.11       131
          25       0.74      0.37      0.49       129
          26       0.46      0.46      0.46       483
          27       0.59      0.64      0.62       138
          28       1.00      1.00      1.00        24
          29       0.96      0.91      0.94       151
          30       0.99      1.00      0.99        67
          31       0.98      1.00      0.99        60
          32       0.46      0.34      0.39       214
          33       0.95      0.87      0.91        46
          34       1.00      0.68      0.81        34
          35       0.08      0.05      0.06       105
          36       0.67      0.39      0.49       127
          37       0.80      0.56      0.65       196
          38       1.00      0.50      0.67        18
          39       0.88      0.97      0.92        29
          40       0.99      0.88      0.93       104
          41       0.64      0.86      0.74       662
          42       0.75      0.40      0.53       250

    accuracy                           0.75      8244
   macro avg       0.81      0.77      0.78      8244
weighted avg       0.76      0.75      0.75      8244

In [181]:
from sklearn.ensemble import VotingClassifier

estimators = [('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('dtc', DecisionTreeClassifier(random_state=42)), ('lsvc', SVC(kernel='linear',probability=True))]

run_classification(VotingClassifier(estimators=estimators, voting='soft'), X_train, X_test, y_train, y_test)
Estimator: Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 VotingClassifier(estimators=[('rf',
                                               RandomForestClassifier(bootstrap=True,
                                                                      ccp_alpha=0.0,
                                                                      class_weight=None,
                                                                      criterion='gini',
                                                                      max_depth=None,
                                                                      max_features='auto',
                                                                      max_leaf_nodes=None,
                                                                      max_samples=None,
                                                                      min_impurity_decrease=0.0,
                                                                      min_impur...
                                                                      presort='deprecated',
                                                                      random_state=42,
                                                                      splitter='best')),
                                              ('lsvc',
                                               SVC(C=1.0, break_ties=False,
                                                   cache_size=200,
                                                   class_weight=None, coef0=0.0,
                                                   decision_function_shape='ovr',
                                                   degree=3, gamma='scale',
                                                   kernel='linear', max_iter=-1,
                                                   probability=True,
                                                   random_state=None,
                                                   shrinking=True, tol=0.001,
                                                   verbose=False))],
                                  flatten_transform=True, n_jobs=None,
                                  voting='soft', weights=None))],
         verbose=False)
================================================================================
Training accuracy: 81.88%
Testing accuracy: 74.26%
================================================================================
Confusion matrix:
 [[991   0   3 ...   2   5   3]
 [  0  29   0 ...   0   3   0]
 [  2   0 148 ...   0  13   0]
 ...
 [  0   0   0 ...  91   0   0]
 [  1   0   4 ...   0 567   2]
 [  1   0   0 ...   0  97 101]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.94      0.84      0.88      1182
           1       0.78      0.63      0.70        46
           2       0.86      0.80      0.83       184
           3       0.98      0.90      0.93        48
           4       0.70      0.73      0.71       503
           5       0.94      0.92      0.93       225
           6       0.93      0.92      0.93       184
           7       0.95      0.97      0.96        58
           8       0.90      0.96      0.93       134
           9       0.82      1.00      0.90        37
          10       0.98      0.94      0.96       135
          11       0.87      0.96      0.91       310
          12       0.86      0.88      0.87       313
          13       0.96      0.91      0.94        58
          14       0.93      0.98      0.95        43
          15       0.90      0.98      0.94        46
          16       0.88      1.00      0.94        37
          17       0.50      0.69      0.58       814
          18       0.74      0.71      0.72       245
          19       0.83      0.81      0.82        91
          20       0.86      0.93      0.89        27
          21       0.76      0.61      0.68       107
          22       0.94      0.91      0.92       151
          23       0.91      0.92      0.91       298
          24       0.26      0.07      0.11       131
          25       0.62      0.37      0.47       129
          26       0.42      0.48      0.44       483
          27       0.55      0.65      0.60       138
          28       0.92      1.00      0.96        24
          29       0.98      0.91      0.94       151
          30       0.93      1.00      0.96        67
          31       0.98      1.00      0.99        60
          32       0.43      0.34      0.38       214
          33       0.93      0.87      0.90        46
          34       0.85      0.68      0.75        34
          35       0.11      0.06      0.08       105
          36       0.69      0.39      0.49       127
          37       0.81      0.56      0.66       196
          38       1.00      0.50      0.67        18
          39       0.96      0.90      0.93        29
          40       0.98      0.88      0.92       104
          41       0.64      0.86      0.73       662
          42       0.80      0.40      0.54       250

    accuracy                           0.74      8244
   macro avg       0.80      0.76      0.77      8244
weighted avg       0.75      0.74      0.74      8244

Deep Neural Networks

In [182]:
# Load the augmented data from pickle file 
with open('/content/Interim_data.pkl','rb') as f:
    clean_data_DL = pickle.load(f)
In [183]:
clean_data_DL.isnull().sum()
Out[183]:
Caller                0
Assignment group      0
language              0
Final_Text          197
dtype: int64
In [184]:
clean_data_DL['Final_Text'] = clean_data_DL['Final_Text'].replace(np.nan, '', regex=True)
In [185]:
clean_data_DL.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27478 entries, 0 to 104
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Caller            27478 non-null  object
 1   Assignment group  27478 non-null  object
 2   language          27478 non-null  object
 3   Final_Text        27478 non-null  object
dtypes: object(4)
memory usage: 1.0+ MB
In [186]:
# Import label encoder 
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
  
# Encode labels in column 'species'. 
clean_data_DL['Assignment group LabelEncoded']= label_encoder.fit_transform(clean_data_DL['Assignment group']) 
  
clean_data_DL['Assignment group LabelEncoded'].unique()
Out[186]:
array([17, 25, 18,  4, 35, 26, 24, 32, 21,  1,  8, 12, 27, 13,  6, 23,  2,
       22, 29,  5, 42, 36, 19, 34, 37, 40, 41, 10,  3,  7,  9, 11, 14, 15,
       16, 20, 28, 30, 31, 33, 38, 39,  0])
In [187]:
onehot_encoded_dict = dict(zip(clean_data_DL['Assignment group'].unique(), clean_data_DL['Assignment group LabelEncoded'].unique()))
len(onehot_encoded_dict)
Out[187]:
43
In [188]:
# Splitting Train Test 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(clean_data_DL['Final_Text'], clean_data_DL['Assignment group LabelEncoded'], test_size=0.3, random_state = 0, stratify=clean_data_DL['Assignment group LabelEncoded'])
print('\033[1mShape of the training set:\033[0m', X_train.shape, X_test.shape)
print('\033[1mShape of the test set:\033[0m', y_train.shape, y_test.shape)
Shape of the training set: (19234,) (8244,)
Shape of the test set: (19234,) (8244,)

Create checkpoints function

In [189]:
#Path where you want to save the weights, model and checkpoints
model_path = "Weights/"
%mkdir Weights

# Define model callbacks
def call_backs(name):
    early_stopping = EarlyStopping(monitor='val_loss', mode='min', min_delta=0.01, patience=3)
    model_checkpoint =  ModelCheckpoint(model_path + name + '_epoch{epoch:02d}_loss{val_loss:.4f}.h5',
                                                               monitor='val_loss',
                                                               verbose=1,
                                                               save_best_only=True,
                                                               save_weights_only=False,
                                                               mode='min',
                                                               period=1)
    return [model_checkpoint, early_stopping]
In [190]:
# Function to build Neural Network
def Build_Model_DNN_Text(shape, nClasses, dropout=0.3):
    """
    buildModel_DNN_Tex(shape, nClasses,dropout)
    Build Deep neural networks Model for text classification
    Shape is input feature space
    nClasses is number of classes
    """
    model = Sequential()
    node = 512 # number of nodes
    nLayers = 4 # number of  hidden layer
    model.add(Dense(node,input_dim=shape,activation='relu'))
    model.add(Dropout(dropout))
    for i in range(0,nLayers):
        model.add(Dense(node,input_dim=node,activation='relu'))
        model.add(Dropout(dropout))
        model.add(BatchNormalization())
    model.add(Dense(nClasses, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    print(model.summary())
    return model
In [191]:
Tfidf_vect = TfidfVectorizer(max_features=2000)
Tfidf_vect.fit(clean_data_DL.Final_Text.astype(str))
X_train_tfidf = Tfidf_vect.transform(X_train)
X_test_tfidf = Tfidf_vect.transform(X_test)

# Instantiate the network
model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 43)
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               1024512   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 512)               2048      
_________________________________________________________________
dense_3 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 512)               2048      
_________________________________________________________________
dense_4 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_4 (Dropout)          (None, 512)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048      
_________________________________________________________________
dense_5 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 512)               2048      
_________________________________________________________________
dense_6 (Dense)              (None, 43)                22059     
=================================================================
Total params: 2,105,387
Trainable params: 2,101,291
Non-trainable params: 4,096
_________________________________________________________________
None
In [192]:
run_classification(model_DNN, X_train_tfidf, X_test_tfidf, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='DNN')

'''model_DNN.fit(X_train_tfidf, y_train,
                              validation_data=(X_test_tfidf, y_test),
                              callbacks=call_backs("NN"),
                              epochs=10,
                              batch_size=128,
                              verbose=2)
predicted = model_DNN.predict(X_test_tfidf)'''
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Train on 19234 samples, validate on 8244 samples
Epoch 1/25
19234/19234 [==============================] - 4s 211us/step - loss: 3.0467 - accuracy: 0.2672 - val_loss: 3.1714 - val_accuracy: 0.1659

Epoch 00001: val_loss improved from inf to 3.17138, saving model to Weights/DNN_epoch01_loss3.1714.h5
Epoch 2/25
19234/19234 [==============================] - 2s 92us/step - loss: 1.8600 - accuracy: 0.4865 - val_loss: 2.5431 - val_accuracy: 0.3070

Epoch 00002: val_loss improved from 3.17138 to 2.54312, saving model to Weights/DNN_epoch02_loss2.5431.h5
Epoch 3/25
19234/19234 [==============================] - 2s 95us/step - loss: 1.3351 - accuracy: 0.6114 - val_loss: 1.6218 - val_accuracy: 0.5118

Epoch 00003: val_loss improved from 2.54312 to 1.62179, saving model to Weights/DNN_epoch03_loss1.6218.h5
Epoch 4/25
19234/19234 [==============================] - 2s 96us/step - loss: 1.0283 - accuracy: 0.6861 - val_loss: 0.9500 - val_accuracy: 0.7054

Epoch 00004: val_loss improved from 1.62179 to 0.95000, saving model to Weights/DNN_epoch04_loss0.9500.h5
Epoch 5/25
19234/19234 [==============================] - 2s 96us/step - loss: 0.8687 - accuracy: 0.7253 - val_loss: 0.8137 - val_accuracy: 0.7273

Epoch 00005: val_loss improved from 0.95000 to 0.81367, saving model to Weights/DNN_epoch05_loss0.8137.h5
Epoch 6/25
19234/19234 [==============================] - 2s 94us/step - loss: 0.7729 - accuracy: 0.7502 - val_loss: 0.8000 - val_accuracy: 0.7344

Epoch 00006: val_loss improved from 0.81367 to 0.80001, saving model to Weights/DNN_epoch06_loss0.8000.h5
Epoch 7/25
19234/19234 [==============================] - 2s 94us/step - loss: 0.7236 - accuracy: 0.7598 - val_loss: 0.7394 - val_accuracy: 0.7453

Epoch 00007: val_loss improved from 0.80001 to 0.73937, saving model to Weights/DNN_epoch07_loss0.7394.h5
Epoch 8/25
19234/19234 [==============================] - 2s 95us/step - loss: 0.6781 - accuracy: 0.7653 - val_loss: 0.7533 - val_accuracy: 0.7325

Epoch 00008: val_loss did not improve from 0.73937
Epoch 9/25
19234/19234 [==============================] - 2s 94us/step - loss: 0.6477 - accuracy: 0.7704 - val_loss: 0.7142 - val_accuracy: 0.7436

Epoch 00009: val_loss improved from 0.73937 to 0.71420, saving model to Weights/DNN_epoch09_loss0.7142.h5
Epoch 10/25
19234/19234 [==============================] - 2s 92us/step - loss: 0.6246 - accuracy: 0.7744 - val_loss: 0.7260 - val_accuracy: 0.7405

Epoch 00010: val_loss did not improve from 0.71420
Epoch 11/25
19234/19234 [==============================] - 2s 93us/step - loss: 0.6105 - accuracy: 0.7790 - val_loss: 0.7140 - val_accuracy: 0.7470

Epoch 00011: val_loss improved from 0.71420 to 0.71404, saving model to Weights/DNN_epoch11_loss0.7140.h5
Epoch 12/25
19234/19234 [==============================] - 2s 92us/step - loss: 0.6013 - accuracy: 0.7797 - val_loss: 0.6893 - val_accuracy: 0.7461

Epoch 00012: val_loss improved from 0.71404 to 0.68931, saving model to Weights/DNN_epoch12_loss0.6893.h5
Epoch 13/25
19234/19234 [==============================] - 2s 91us/step - loss: 0.5872 - accuracy: 0.7802 - val_loss: 0.7027 - val_accuracy: 0.7467

Epoch 00013: val_loss did not improve from 0.68931
Epoch 14/25
19234/19234 [==============================] - 2s 96us/step - loss: 0.5781 - accuracy: 0.7840 - val_loss: 0.7099 - val_accuracy: 0.7415

Epoch 00014: val_loss did not improve from 0.68931
Epoch 15/25
19234/19234 [==============================] - 2s 93us/step - loss: 0.5640 - accuracy: 0.7839 - val_loss: 0.7087 - val_accuracy: 0.7473

Epoch 00015: val_loss did not improve from 0.68931
Estimator: <keras.engine.sequential.Sequential object at 0x7fca7fcca898>
================================================================================
Training accuracy: 80.73%
Testing accuracy: 74.73%
================================================================================
Confusion matrix:
 [[1021    0    2 ...    0    6    4]
 [   0   28    0 ...    0    3    0]
 [   0    0  152 ...    0   13    0]
 ...
 [   1    0    0 ...   95    0    0]
 [   0    0    4 ...    0  565    0]
 [   6    0    0 ...    0   98   96]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.97      0.86      0.91      1182
           1       1.00      0.61      0.76        46
           2       0.72      0.83      0.77       184
           3       1.00      0.92      0.96        48
           4       0.61      0.81      0.69       503
           5       0.98      0.92      0.94       225
           6       0.82      0.93      0.87       184
           7       0.98      1.00      0.99        58
           8       0.73      1.00      0.85       134
           9       0.93      1.00      0.96        37
          10       0.87      0.97      0.92       135
          11       0.89      0.99      0.93       310
          12       0.86      0.85      0.86       313
          13       0.98      0.91      0.95        58
          14       1.00      1.00      1.00        43
          15       0.94      1.00      0.97        46
          16       0.95      1.00      0.97        37
          17       0.53      0.64      0.58       814
          18       0.84      0.73      0.78       245
          19       0.88      0.88      0.88        91
          20       0.93      0.96      0.95        27
          21       0.50      0.76      0.60       107
          22       0.94      0.91      0.93       151
          23       0.90      0.92      0.91       298
          24       0.12      0.01      0.01       131
          25       0.56      0.42      0.48       129
          26       0.54      0.41      0.47       483
          27       0.43      0.81      0.57       138
          28       1.00      1.00      1.00        24
          29       0.95      0.82      0.88       151
          30       0.96      0.97      0.96        67
          31       0.98      1.00      0.99        60
          32       0.59      0.31      0.41       214
          33       0.95      0.87      0.91        46
          34       0.82      0.68      0.74        34
          35       0.14      0.10      0.11       105
          36       0.60      0.39      0.47       127
          37       0.92      0.51      0.65       196
          38       0.82      0.50      0.62        18
          39       0.85      1.00      0.92        29
          40       1.00      0.91      0.95       104
          41       0.63      0.85      0.73       662
          42       0.79      0.38      0.52       250

    accuracy                           0.75      8244
   macro avg       0.80      0.78      0.78      8244
weighted avg       0.75      0.75      0.74      8244

Out[192]:
'model_DNN.fit(X_train_tfidf, y_train,\n                              validation_data=(X_test_tfidf, y_test),\n                              callbacks=call_backs("NN"),\n                              epochs=10,\n                              batch_size=128,\n                              verbose=2)\npredicted = model_DNN.predict(X_test_tfidf)'

Extract Glove Embeddings

In [193]:
#download the glove embedding zip file from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
from zipfile import ZipFile
# Check if it is already extracted else Open the zipped file as readonly
if not os.path.isfile('glove.6B/glove.6B.200d.txt'):
    #glove_embeddings = 'glove.6B.zip'
    glove_embeddings = '/content/drive/MyDrive/Capstone/glove.6B.zip'
    with ZipFile(glove_embeddings, 'r') as archive:
        archive.extractall('glove.6B')

# List the files under extracted folder
os.listdir('glove.6B')
Out[193]:
['glove.6B.300d.txt',
 'glove.6B.100d.txt',
 'glove.6B.50d.txt',
 'glove.6B.200d.txt']

Convolutional Neural Networks (CNN)

In [194]:
#gloveFileName = 'glove.6B/glove.6B.200d.txt'
gloveFileName = '/content/glove.6B/glove.6B.200d.txt'
MAX_SEQUENCE_LENGTH = 500
EMBEDDING_DIM=200
MAX_NB_WORDS=75000

# Function to generate Embedding
def loadData_Tokenizer(X_train, X_test,filename):
    np.random.seed(7)
    text = np.concatenate((X_train, X_test), axis=0)
    text = np.array(text)
    tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
    tokenizer.fit_on_texts(text)
    sequences = tokenizer.texts_to_sequences(text)
    word_index = tokenizer.word_index
    text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
    print('Found %s unique tokens.' % len(word_index))
    indices = np.arange(text.shape[0])
    # np.random.shuffle(indices)
    text = text[indices]
    print(text.shape)
    X_train = text[0:len(X_train), ]
    X_test = text[len(X_train):, ]
    embeddings_index = {}
    f = open(filename, encoding="utf8")
    for line in f:
        values = line.split()
        word = values[0]
        try:
            coefs = np.asarray(values[1:], dtype='float32')
        except:
            pass
        embeddings_index[word] = coefs
    f.close()
    print('Total %s word vectors.' % len(embeddings_index))
    return (X_train, X_test, word_index,embeddings_index)


embedding_matrix = []

def buildEmbed_matrices(word_index,embedding_dim):
    embedding_matrix = np.random.random((len(word_index) + 1, embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            if len(embedding_matrix[i]) !=len(embedding_vector):
                print("could not broadcast input array from shape",str(len(embedding_matrix[i])), "into shape",str(len(embedding_vector)),
                      " Please make sure your"" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
                exit(1)
            embedding_matrix[i] = embedding_vector
    return embedding_matrix
In [195]:
# Generate Glove embedded datasets
X_train_Glove, X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train,X_test,gloveFileName)
embedding_matrix = buildEmbed_matrices(word_index,EMBEDDING_DIM)
Found 14223 unique tokens.
(27478, 500)
Total 400001 word vectors.
In [196]:
def Build_Model_CNN_Text(word_index, embeddings_matrix, nclasses,dropout=0.5):
    """
        def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
        word_index in word index ,
        embeddings_index is embeddings index, look at data_helper.py
        nClasses is number of classes,
        MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,
        EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py
    """
    model = Sequential()
    embedding_layer = Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embeddings_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True)
    # applying a more complex convolutional approach
    convs = []
    filter_sizes = []
    layer = 5
    print("Filter  ",layer)
    for fl in range(0,layer):
        filter_sizes.append((fl+2))
    node = 128
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)
    for fsz in filter_sizes:
        l_conv = Conv1D(node, kernel_size=fsz, activation='relu')(embedded_sequences)
        l_pool = MaxPooling1D(5)(l_conv)
        #l_pool = Dropout(0.25)(l_pool)
        convs.append(l_pool)
    l_merge = Concatenate(axis=1)(convs)
    l_cov1 = Conv1D(node, 5, activation='relu')(l_merge)
    l_cov1 = Dropout(dropout)(l_cov1)
    l_batch1 = BatchNormalization()(l_cov1)
    l_pool1 = MaxPooling1D(5)(l_batch1)
    l_cov2 = Conv1D(node, 5, activation='relu')(l_pool1)
    l_cov2 = Dropout(dropout)(l_cov2)
    l_batch2 = BatchNormalization()(l_cov2)
    l_pool2 = MaxPooling1D(30)(l_batch2)
    l_flat = Flatten()(l_pool2)
    l_dense = Dense(1024, activation='relu')(l_flat)
    l_dense = Dropout(dropout)(l_dense)
    l_dense = Dense(512, activation='relu')(l_dense)
    l_dense = Dropout(dropout)(l_dense)
    preds = Dense(nclasses, activation='softmax')(l_dense)
    model = Model(sequence_input, preds)
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    
    print(model.summary())
    return model
In [197]:
# Train the network and run classification
model_CNN = Build_Model_CNN_Text(word_index,embedding_matrix, 43)
run_classification(model_CNN, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='CNN')
Filter   5
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 500)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 500, 200)     2844800     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 499, 128)     51328       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 498, 128)     76928       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 497, 128)     102528      embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 496, 128)     128128      embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 495, 128)     153728      embedding_1[0][0]                
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D)  (None, 99, 128)      0           conv1d_1[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_2 (MaxPooling1D)  (None, 99, 128)      0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_3 (MaxPooling1D)  (None, 99, 128)      0           conv1d_3[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_4 (MaxPooling1D)  (None, 99, 128)      0           conv1d_4[0][0]                   
__________________________________________________________________________________________________
max_pooling1d_5 (MaxPooling1D)  (None, 99, 128)      0           conv1d_5[0][0]                   
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 495, 128)     0           max_pooling1d_1[0][0]            
                                                                 max_pooling1d_2[0][0]            
                                                                 max_pooling1d_3[0][0]            
                                                                 max_pooling1d_4[0][0]            
                                                                 max_pooling1d_5[0][0]            
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 491, 128)     82048       concatenate_1[0][0]              
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 491, 128)     0           conv1d_6[0][0]                   
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 491, 128)     512         dropout_6[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_6 (MaxPooling1D)  (None, 98, 128)      0           batch_normalization_5[0][0]      
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 94, 128)      82048       max_pooling1d_6[0][0]            
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, 94, 128)      0           conv1d_7[0][0]                   
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 94, 128)      512         dropout_7[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_7 (MaxPooling1D)  (None, 3, 128)       0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 384)          0           max_pooling1d_7[0][0]            
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 1024)         394240      flatten_1[0][0]                  
__________________________________________________________________________________________________
dropout_8 (Dropout)             (None, 1024)         0           dense_7[0][0]                    
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 512)          524800      dropout_8[0][0]                  
__________________________________________________________________________________________________
dropout_9 (Dropout)             (None, 512)          0           dense_8[0][0]                    
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, 43)           22059       dropout_9[0][0]                  
==================================================================================================
Total params: 4,463,659
Trainable params: 4,463,147
Non-trainable params: 512
__________________________________________________________________________________________________
None
Train on 19234 samples, validate on 8244 samples
Epoch 1/25
19234/19234 [==============================] - 25s 1ms/step - loss: 3.4089 - accuracy: 0.1399 - val_loss: 3.4903 - val_accuracy: 0.1436

Epoch 00001: val_loss improved from inf to 3.49034, saving model to Weights/CNN_epoch01_loss3.4903.h5
Epoch 2/25
19234/19234 [==============================] - 17s 895us/step - loss: 3.1897 - accuracy: 0.1716 - val_loss: 3.3017 - val_accuracy: 0.1689

Epoch 00002: val_loss improved from 3.49034 to 3.30168, saving model to Weights/CNN_epoch02_loss3.3017.h5
Epoch 3/25
19234/19234 [==============================] - 17s 890us/step - loss: 2.9973 - accuracy: 0.2072 - val_loss: 3.0111 - val_accuracy: 0.2226

Epoch 00003: val_loss improved from 3.30168 to 3.01114, saving model to Weights/CNN_epoch03_loss3.0111.h5
Epoch 4/25
19234/19234 [==============================] - 17s 888us/step - loss: 2.8013 - accuracy: 0.2279 - val_loss: 2.8811 - val_accuracy: 0.2310

Epoch 00004: val_loss improved from 3.01114 to 2.88113, saving model to Weights/CNN_epoch04_loss2.8811.h5
Epoch 5/25
19234/19234 [==============================] - 17s 891us/step - loss: 2.5974 - accuracy: 0.2810 - val_loss: 2.6307 - val_accuracy: 0.3269

Epoch 00005: val_loss improved from 2.88113 to 2.63068, saving model to Weights/CNN_epoch05_loss2.6307.h5
Epoch 6/25
19234/19234 [==============================] - 17s 891us/step - loss: 2.3503 - accuracy: 0.3344 - val_loss: 2.3924 - val_accuracy: 0.3575

Epoch 00006: val_loss improved from 2.63068 to 2.39235, saving model to Weights/CNN_epoch06_loss2.3924.h5
Epoch 7/25
19234/19234 [==============================] - 17s 890us/step - loss: 2.1177 - accuracy: 0.3760 - val_loss: 2.1706 - val_accuracy: 0.3850

Epoch 00007: val_loss improved from 2.39235 to 2.17057, saving model to Weights/CNN_epoch07_loss2.1706.h5
Epoch 8/25
19234/19234 [==============================] - 17s 886us/step - loss: 1.9166 - accuracy: 0.4177 - val_loss: 2.1364 - val_accuracy: 0.3838

Epoch 00008: val_loss improved from 2.17057 to 2.13643, saving model to Weights/CNN_epoch08_loss2.1364.h5
Epoch 9/25
19234/19234 [==============================] - 17s 887us/step - loss: 1.7982 - accuracy: 0.4577 - val_loss: 1.9098 - val_accuracy: 0.4563

Epoch 00009: val_loss improved from 2.13643 to 1.90978, saving model to Weights/CNN_epoch09_loss1.9098.h5
Epoch 10/25
19234/19234 [==============================] - 17s 883us/step - loss: 1.6649 - accuracy: 0.4870 - val_loss: 1.7808 - val_accuracy: 0.4612

Epoch 00010: val_loss improved from 1.90978 to 1.78079, saving model to Weights/CNN_epoch10_loss1.7808.h5
Epoch 11/25
19234/19234 [==============================] - 17s 884us/step - loss: 1.5615 - accuracy: 0.5143 - val_loss: 1.7504 - val_accuracy: 0.4586

Epoch 00011: val_loss improved from 1.78079 to 1.75039, saving model to Weights/CNN_epoch11_loss1.7504.h5
Epoch 12/25
19234/19234 [==============================] - 17s 883us/step - loss: 1.4827 - accuracy: 0.5393 - val_loss: 1.6248 - val_accuracy: 0.4911

Epoch 00012: val_loss improved from 1.75039 to 1.62476, saving model to Weights/CNN_epoch12_loss1.6248.h5
Epoch 13/25
19234/19234 [==============================] - 17s 882us/step - loss: 1.3846 - accuracy: 0.5678 - val_loss: 1.6083 - val_accuracy: 0.5132

Epoch 00013: val_loss improved from 1.62476 to 1.60831, saving model to Weights/CNN_epoch13_loss1.6083.h5
Epoch 14/25
19234/19234 [==============================] - 17s 883us/step - loss: 1.3194 - accuracy: 0.5895 - val_loss: 1.6047 - val_accuracy: 0.5118

Epoch 00014: val_loss improved from 1.60831 to 1.60473, saving model to Weights/CNN_epoch14_loss1.6047.h5
Epoch 15/25
19234/19234 [==============================] - 17s 878us/step - loss: 1.2458 - accuracy: 0.6146 - val_loss: 1.6215 - val_accuracy: 0.5223

Epoch 00015: val_loss did not improve from 1.60473
Epoch 16/25
19234/19234 [==============================] - 17s 877us/step - loss: 1.1928 - accuracy: 0.6320 - val_loss: 1.5519 - val_accuracy: 0.5164

Epoch 00016: val_loss improved from 1.60473 to 1.55195, saving model to Weights/CNN_epoch16_loss1.5519.h5
Epoch 17/25
19234/19234 [==============================] - 17s 876us/step - loss: 1.1664 - accuracy: 0.6472 - val_loss: 1.4620 - val_accuracy: 0.5729

Epoch 00017: val_loss improved from 1.55195 to 1.46201, saving model to Weights/CNN_epoch17_loss1.4620.h5
Epoch 18/25
19234/19234 [==============================] - 17s 876us/step - loss: 1.1018 - accuracy: 0.6638 - val_loss: 1.4290 - val_accuracy: 0.5678

Epoch 00018: val_loss improved from 1.46201 to 1.42901, saving model to Weights/CNN_epoch18_loss1.4290.h5
Epoch 19/25
19234/19234 [==============================] - 17s 877us/step - loss: 1.0558 - accuracy: 0.6766 - val_loss: 1.3820 - val_accuracy: 0.6044

Epoch 00019: val_loss improved from 1.42901 to 1.38196, saving model to Weights/CNN_epoch19_loss1.3820.h5
Epoch 20/25
19234/19234 [==============================] - 17s 875us/step - loss: 1.0351 - accuracy: 0.6843 - val_loss: 1.3551 - val_accuracy: 0.6054

Epoch 00020: val_loss improved from 1.38196 to 1.35512, saving model to Weights/CNN_epoch20_loss1.3551.h5
Epoch 21/25
19234/19234 [==============================] - 17s 875us/step - loss: 1.0051 - accuracy: 0.6906 - val_loss: 1.4105 - val_accuracy: 0.5849

Epoch 00021: val_loss did not improve from 1.35512
Epoch 22/25
19234/19234 [==============================] - 17s 876us/step - loss: 0.9613 - accuracy: 0.7044 - val_loss: 1.3272 - val_accuracy: 0.6106

Epoch 00022: val_loss improved from 1.35512 to 1.32720, saving model to Weights/CNN_epoch22_loss1.3272.h5
Epoch 23/25
19234/19234 [==============================] - 17s 876us/step - loss: 0.9478 - accuracy: 0.7085 - val_loss: 1.3787 - val_accuracy: 0.6023

Epoch 00023: val_loss did not improve from 1.32720
Epoch 24/25
19234/19234 [==============================] - 17s 878us/step - loss: 0.9330 - accuracy: 0.7106 - val_loss: 1.3499 - val_accuracy: 0.6083

Epoch 00024: val_loss did not improve from 1.32720
Epoch 25/25
19234/19234 [==============================] - 17s 875us/step - loss: 0.8950 - accuracy: 0.7209 - val_loss: 1.3872 - val_accuracy: 0.6033

Epoch 00025: val_loss did not improve from 1.32720
Estimator: <keras.engine.training.Model object at 0x7fca7f422f98>
================================================================================
Training accuracy: 65.63%
Testing accuracy: 60.33%
================================================================================
Confusion matrix:
 [[1002    0    1 ...    0    0    0]
 [   0    2    0 ...    0    7    0]
 [   0    0  103 ...    0   12    0]
 ...
 [   0    0    0 ...   22    0    0]
 [  17    2    4 ...    0  545    0]
 [   9    2    0 ...    0   76   96]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.76      0.85      0.80      1182
           1       0.07      0.04      0.05        46
           2       0.80      0.56      0.66       184
           3       0.25      0.10      0.15        48
           4       0.44      0.79      0.57       503
           5       0.79      0.89      0.84       225
           6       0.97      0.74      0.84       184
           7       0.61      0.24      0.35        58
           8       0.81      0.74      0.77       134
           9       0.95      1.00      0.97        37
          10       0.75      0.84      0.80       135
          11       0.58      0.99      0.73       310
          12       0.86      0.77      0.81       313
          13       0.48      0.52      0.50        58
          14       0.31      0.33      0.32        43
          15       0.23      0.22      0.22        46
          16       0.00      0.00      0.00        37
          17       0.44      0.63      0.52       814
          18       0.83      0.42      0.56       245
          19       0.30      0.35      0.32        91
          20       0.00      0.00      0.00        27
          21       0.43      0.03      0.05       107
          22       0.90      0.92      0.91       151
          23       0.91      0.90      0.90       298
          24       0.00      0.00      0.00       131
          25       0.08      0.25      0.12       129
          26       0.73      0.23      0.35       483
          27       0.37      0.16      0.22       138
          28       0.11      0.04      0.06        24
          29       0.95      0.73      0.82       151
          30       0.48      0.72      0.57        67
          31       0.38      0.28      0.32        60
          32       0.31      0.39      0.34       214
          33       0.50      0.20      0.28        46
          34       0.30      0.18      0.22        34
          35       0.00      0.00      0.00       105
          36       0.44      0.12      0.19       127
          37       0.93      0.46      0.61       196
          38       0.00      0.00      0.00        18
          39       0.00      0.00      0.00        29
          40       0.79      0.21      0.33       104
          41       0.66      0.82      0.73       662
          42       0.95      0.38      0.55       250

    accuracy                           0.60      8244
   macro avg       0.50      0.42      0.43      8244
weighted avg       0.63      0.60      0.58      8244

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Recurrent Neural Networks (RNN) --> Gated Recurrent Unit (GRU)

In [198]:
def Build_Model_RNN_Text(word_index, embeddings_matrix, nclasses,dropout=0.5):
    """
    def buildModel_RNN(word_index, embeddings_matrix, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100, dropout=0.5):
    word_index in word index ,
    embeddings_matrix is embeddings_matrix, look at data_helper.py
    nClasses is number of classes,
    MAX_SEQUENCE_LENGTH is maximum lenght of text sequences
    """
    model = Sequential()
    hidden_layer = 3
    gru_node = 32
    
    model.add(Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embeddings_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True))
    print(gru_node)
    for i in range(0,hidden_layer):
        model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
        model.add(Dropout(dropout))
        model.add(BatchNormalization())
    model.add(GRU(gru_node, recurrent_dropout=0.2))
    model.add(Dropout(dropout))
    model.add(BatchNormalization())
    model.add(Dense(256, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(nclasses, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy',
                      optimizer='sgd',
                      metrics=['accuracy'])
    
    print(model.summary())
    return model
In [199]:
# Train the network and run classification
model_RNN = Build_Model_RNN_Text(word_index,embedding_matrix, 43)
run_classification(model_RNN, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='RNN')
32
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 500, 200)          2844800   
_________________________________________________________________
gru_1 (GRU)                  (None, 500, 32)           22368     
_________________________________________________________________
dropout_10 (Dropout)         (None, 500, 32)           0         
_________________________________________________________________
batch_normalization_7 (Batch (None, 500, 32)           128       
_________________________________________________________________
gru_2 (GRU)                  (None, 500, 32)           6240      
_________________________________________________________________
dropout_11 (Dropout)         (None, 500, 32)           0         
_________________________________________________________________
batch_normalization_8 (Batch (None, 500, 32)           128       
_________________________________________________________________
gru_3 (GRU)                  (None, 500, 32)           6240      
_________________________________________________________________
dropout_12 (Dropout)         (None, 500, 32)           0         
_________________________________________________________________
batch_normalization_9 (Batch (None, 500, 32)           128       
_________________________________________________________________
gru_4 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dropout_13 (Dropout)         (None, 32)                0         
_________________________________________________________________
batch_normalization_10 (Batc (None, 32)                128       
_________________________________________________________________
dense_10 (Dense)             (None, 256)               8448      
_________________________________________________________________
batch_normalization_11 (Batc (None, 256)               1024      
_________________________________________________________________
dense_11 (Dense)             (None, 43)                11051     
=================================================================
Total params: 2,906,923
Trainable params: 2,906,155
Non-trainable params: 768
_________________________________________________________________
None
Train on 19234 samples, validate on 8244 samples
Epoch 1/25
19234/19234 [==============================] - 359s 19ms/step - loss: 3.6944 - accuracy: 0.1434 - val_loss: 3.6297 - val_accuracy: 0.1434

Epoch 00001: val_loss improved from inf to 3.62971, saving model to Weights/RNN_epoch01_loss3.6297.h5
Epoch 2/25
19234/19234 [==============================] - 359s 19ms/step - loss: 3.5731 - accuracy: 0.1434 - val_loss: 3.5186 - val_accuracy: 0.1434

Epoch 00002: val_loss improved from 3.62971 to 3.51857, saving model to Weights/RNN_epoch02_loss3.5186.h5
Epoch 3/25
19234/19234 [==============================] - 357s 19ms/step - loss: 3.4738 - accuracy: 0.1434 - val_loss: 3.4328 - val_accuracy: 0.1434

Epoch 00003: val_loss improved from 3.51857 to 3.43282, saving model to Weights/RNN_epoch03_loss3.4328.h5
Epoch 4/25
19234/19234 [==============================] - 354s 18ms/step - loss: 3.4032 - accuracy: 0.1434 - val_loss: 3.3774 - val_accuracy: 0.1434

Epoch 00004: val_loss improved from 3.43282 to 3.37736, saving model to Weights/RNN_epoch04_loss3.3774.h5
Epoch 5/25
19234/19234 [==============================] - 352s 18ms/step - loss: 3.3598 - accuracy: 0.1434 - val_loss: 3.3446 - val_accuracy: 0.1434

Epoch 00005: val_loss improved from 3.37736 to 3.34460, saving model to Weights/RNN_epoch05_loss3.3446.h5
Epoch 6/25
19234/19234 [==============================] - 348s 18ms/step - loss: 3.3339 - accuracy: 0.1434 - val_loss: 3.3242 - val_accuracy: 0.1434

Epoch 00006: val_loss improved from 3.34460 to 3.32424, saving model to Weights/RNN_epoch06_loss3.3242.h5
Epoch 7/25
19234/19234 [==============================] - 343s 18ms/step - loss: 3.3173 - accuracy: 0.1434 - val_loss: 3.3109 - val_accuracy: 0.1434

Epoch 00007: val_loss improved from 3.32424 to 3.31094, saving model to Weights/RNN_epoch07_loss3.3109.h5
Epoch 8/25
19234/19234 [==============================] - 342s 18ms/step - loss: 3.3061 - accuracy: 0.1434 - val_loss: 3.3014 - val_accuracy: 0.1434

Epoch 00008: val_loss improved from 3.31094 to 3.30139, saving model to Weights/RNN_epoch08_loss3.3014.h5
Epoch 9/25
19234/19234 [==============================] - 342s 18ms/step - loss: 3.2979 - accuracy: 0.1434 - val_loss: 3.2944 - val_accuracy: 0.1434

Epoch 00009: val_loss improved from 3.30139 to 3.29438, saving model to Weights/RNN_epoch09_loss3.2944.h5
Epoch 10/25
19234/19234 [==============================] - 349s 18ms/step - loss: 3.2918 - accuracy: 0.1434 - val_loss: 3.2891 - val_accuracy: 0.1434

Epoch 00010: val_loss improved from 3.29438 to 3.28906, saving model to Weights/RNN_epoch10_loss3.2891.h5
Epoch 11/25
19234/19234 [==============================] - 347s 18ms/step - loss: 3.2871 - accuracy: 0.1434 - val_loss: 3.2849 - val_accuracy: 0.1434

Epoch 00011: val_loss improved from 3.28906 to 3.28495, saving model to Weights/RNN_epoch11_loss3.2849.h5
Epoch 12/25
19234/19234 [==============================] - 348s 18ms/step - loss: 3.2835 - accuracy: 0.1434 - val_loss: 3.2817 - val_accuracy: 0.1434

Epoch 00012: val_loss improved from 3.28495 to 3.28171, saving model to Weights/RNN_epoch12_loss3.2817.h5
Epoch 13/25
19234/19234 [==============================] - 347s 18ms/step - loss: 3.2806 - accuracy: 0.1434 - val_loss: 3.2791 - val_accuracy: 0.1434

Epoch 00013: val_loss improved from 3.28171 to 3.27913, saving model to Weights/RNN_epoch13_loss3.2791.h5
Epoch 14/25
19234/19234 [==============================] - 342s 18ms/step - loss: 3.2783 - accuracy: 0.1434 - val_loss: 3.2770 - val_accuracy: 0.1434

Epoch 00014: val_loss improved from 3.27913 to 3.27704, saving model to Weights/RNN_epoch14_loss3.2770.h5
Epoch 15/25
19234/19234 [==============================] - 342s 18ms/step - loss: 3.2764 - accuracy: 0.1434 - val_loss: 3.2754 - val_accuracy: 0.1434

Epoch 00015: val_loss improved from 3.27704 to 3.27538, saving model to Weights/RNN_epoch15_loss3.2754.h5
Estimator: <keras.engine.sequential.Sequential object at 0x7fca8015e9e8>
================================================================================
Training accuracy: 14.34%
Testing accuracy: 14.34%
================================================================================
Confusion matrix:
 [[1182    0    0 ...    0    0    0]
 [  46    0    0 ...    0    0    0]
 [ 184    0    0 ...    0    0    0]
 ...
 [ 104    0    0 ...    0    0    0]
 [ 662    0    0 ...    0    0    0]
 [ 250    0    0 ...    0    0    0]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.14      1.00      0.25      1182
           1       0.00      0.00      0.00        46
           2       0.00      0.00      0.00       184
           3       0.00      0.00      0.00        48
           4       0.00      0.00      0.00       503
           5       0.00      0.00      0.00       225
           6       0.00      0.00      0.00       184
           7       0.00      0.00      0.00        58
           8       0.00      0.00      0.00       134
           9       0.00      0.00      0.00        37
          10       0.00      0.00      0.00       135
          11       0.00      0.00      0.00       310
          12       0.00      0.00      0.00       313
          13       0.00      0.00      0.00        58
          14       0.00      0.00      0.00        43
          15       0.00      0.00      0.00        46
          16       0.00      0.00      0.00        37
          17       0.00      0.00      0.00       814
          18       0.00      0.00      0.00       245
          19       0.00      0.00      0.00        91
          20       0.00      0.00      0.00        27
          21       0.00      0.00      0.00       107
          22       0.00      0.00      0.00       151
          23       0.00      0.00      0.00       298
          24       0.00      0.00      0.00       131
          25       0.00      0.00      0.00       129
          26       0.00      0.00      0.00       483
          27       0.00      0.00      0.00       138
          28       0.00      0.00      0.00        24
          29       0.00      0.00      0.00       151
          30       0.00      0.00      0.00        67
          31       0.00      0.00      0.00        60
          32       0.00      0.00      0.00       214
          33       0.00      0.00      0.00        46
          34       0.00      0.00      0.00        34
          35       0.00      0.00      0.00       105
          36       0.00      0.00      0.00       127
          37       0.00      0.00      0.00       196
          38       0.00      0.00      0.00        18
          39       0.00      0.00      0.00        29
          40       0.00      0.00      0.00       104
          41       0.00      0.00      0.00       662
          42       0.00      0.00      0.00       250

    accuracy                           0.14      8244
   macro avg       0.00      0.02      0.01      8244
weighted avg       0.02      0.14      0.04      8244

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

RNN with LSTM networks

In [200]:
EMBEDDING_DIM = 200
#gloveFileName = 'glove.6B/glove.6B.100d.txt'
gloveFileName = '/content/glove.6B/glove.6B.200d.txt'

from keras.models import Sequential
from keras.layers import Dense, LSTM, TimeDistributed, Activation
from keras.layers import Flatten, Permute, merge, Input
from keras.layers import Embedding
from keras.models import Model
from keras.layers import Input, Dense, multiply, concatenate, Dropout
from keras.layers import GRU, Bidirectional


def Build_Model_LTSM_Text(word_index, embeddings_matrix, nclasses):
    kernel_size = 2
    filters = 256
    pool_size = 2
    gru_node = 256
    
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embeddings_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True))
    model.add(Dropout(0.25))
    model.add(Conv1D(filters, kernel_size, activation='relu'))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(Conv1D(filters, kernel_size, activation='relu'))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(Conv1D(filters, kernel_size, activation='relu'))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(Conv1D(filters, kernel_size, activation='relu'))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
    model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
    model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
    model.add(Bidirectional(LSTM(gru_node, recurrent_dropout=0.2)))
    model.add(Dense(1024,activation='relu'))
    model.add(Dense(nclasses))
    model.add(Activation('softmax'))
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    
    print(model.summary())
    return model
In [201]:
X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test,gloveFileName)
embedding_matrix = buildEmbed_matrices(word_index,EMBEDDING_DIM)

model_LTSM = Build_Model_LTSM_Text(word_index,embedding_matrix, 43)
run_classification(model_LTSM, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='LSTM')
Found 14223 unique tokens.
(27478, 500)
Total 400001 word vectors.
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 500, 200)          2844800   
_________________________________________________________________
dropout_14 (Dropout)         (None, 500, 200)          0         
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 499, 256)          102656    
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 249, 256)          0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 248, 256)          131328    
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 124, 256)          0         
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 123, 256)          131328    
_________________________________________________________________
max_pooling1d_10 (MaxPooling (None, 61, 256)           0         
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 60, 256)           131328    
_________________________________________________________________
max_pooling1d_11 (MaxPooling (None, 30, 256)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 512)           1050624   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 30, 512)           1574912   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 30, 512)           1574912   
_________________________________________________________________
bidirectional_4 (Bidirection (None, 512)               1574912   
_________________________________________________________________
dense_12 (Dense)             (None, 1024)              525312    
_________________________________________________________________
dense_13 (Dense)             (None, 43)                44075     
_________________________________________________________________
activation_1 (Activation)    (None, 43)                0         
=================================================================
Total params: 9,686,187
Trainable params: 9,686,187
Non-trainable params: 0
_________________________________________________________________
None
Train on 19234 samples, validate on 8244 samples
Epoch 1/25
19234/19234 [==============================] - 49s 3ms/step - loss: 3.2713 - accuracy: 0.1412 - val_loss: 3.1682 - val_accuracy: 0.1525

Epoch 00001: val_loss improved from inf to 3.16824, saving model to Weights/LSTM_epoch01_loss3.1682.h5
Epoch 2/25
19234/19234 [==============================] - 45s 2ms/step - loss: 3.0124 - accuracy: 0.2108 - val_loss: 2.8196 - val_accuracy: 0.2642

Epoch 00002: val_loss improved from 3.16824 to 2.81964, saving model to Weights/LSTM_epoch02_loss2.8196.h5
Epoch 3/25
19234/19234 [==============================] - 45s 2ms/step - loss: 2.7087 - accuracy: 0.2840 - val_loss: 2.6226 - val_accuracy: 0.2959

Epoch 00003: val_loss improved from 2.81964 to 2.62258, saving model to Weights/LSTM_epoch03_loss2.6226.h5
Epoch 4/25
19234/19234 [==============================] - 45s 2ms/step - loss: 2.5020 - accuracy: 0.3169 - val_loss: 2.4535 - val_accuracy: 0.3159

Epoch 00004: val_loss improved from 2.62258 to 2.45354, saving model to Weights/LSTM_epoch04_loss2.4535.h5
Epoch 5/25
19234/19234 [==============================] - 45s 2ms/step - loss: 2.3377 - accuracy: 0.3366 - val_loss: 2.2957 - val_accuracy: 0.3322

Epoch 00005: val_loss improved from 2.45354 to 2.29569, saving model to Weights/LSTM_epoch05_loss2.2957.h5
Epoch 6/25
19234/19234 [==============================] - 45s 2ms/step - loss: 2.2205 - accuracy: 0.3606 - val_loss: 2.2392 - val_accuracy: 0.3375

Epoch 00006: val_loss improved from 2.29569 to 2.23923, saving model to Weights/LSTM_epoch06_loss2.2392.h5
Epoch 7/25
19234/19234 [==============================] - 45s 2ms/step - loss: 2.0673 - accuracy: 0.3901 - val_loss: 2.1029 - val_accuracy: 0.3842

Epoch 00007: val_loss improved from 2.23923 to 2.10290, saving model to Weights/LSTM_epoch07_loss2.1029.h5
Epoch 8/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.9126 - accuracy: 0.4265 - val_loss: 1.9873 - val_accuracy: 0.4045

Epoch 00008: val_loss improved from 2.10290 to 1.98728, saving model to Weights/LSTM_epoch08_loss1.9873.h5
Epoch 9/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.8038 - accuracy: 0.4555 - val_loss: 1.8176 - val_accuracy: 0.4544

Epoch 00009: val_loss improved from 1.98728 to 1.81762, saving model to Weights/LSTM_epoch09_loss1.8176.h5
Epoch 10/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.6950 - accuracy: 0.4832 - val_loss: 1.7161 - val_accuracy: 0.4845

Epoch 00010: val_loss improved from 1.81762 to 1.71609, saving model to Weights/LSTM_epoch10_loss1.7161.h5
Epoch 11/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.6000 - accuracy: 0.5135 - val_loss: 1.6424 - val_accuracy: 0.5017

Epoch 00011: val_loss improved from 1.71609 to 1.64239, saving model to Weights/LSTM_epoch11_loss1.6424.h5
Epoch 12/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.4934 - accuracy: 0.5453 - val_loss: 1.5523 - val_accuracy: 0.5411

Epoch 00012: val_loss improved from 1.64239 to 1.55234, saving model to Weights/LSTM_epoch12_loss1.5523.h5
Epoch 13/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.4114 - accuracy: 0.5665 - val_loss: 1.5261 - val_accuracy: 0.5462

Epoch 00013: val_loss improved from 1.55234 to 1.52611, saving model to Weights/LSTM_epoch13_loss1.5261.h5
Epoch 14/25
19234/19234 [==============================] - 44s 2ms/step - loss: 1.3433 - accuracy: 0.5883 - val_loss: 1.4105 - val_accuracy: 0.5825

Epoch 00014: val_loss improved from 1.52611 to 1.41049, saving model to Weights/LSTM_epoch14_loss1.4105.h5
Epoch 15/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.2690 - accuracy: 0.6045 - val_loss: 1.4453 - val_accuracy: 0.5638

Epoch 00015: val_loss did not improve from 1.41049
Epoch 16/25
19234/19234 [==============================] - 44s 2ms/step - loss: 1.1927 - accuracy: 0.6256 - val_loss: 1.3429 - val_accuracy: 0.6003

Epoch 00016: val_loss improved from 1.41049 to 1.34286, saving model to Weights/LSTM_epoch16_loss1.3429.h5
Epoch 17/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.1445 - accuracy: 0.6436 - val_loss: 1.2581 - val_accuracy: 0.6195

Epoch 00017: val_loss improved from 1.34286 to 1.25811, saving model to Weights/LSTM_epoch17_loss1.2581.h5
Epoch 18/25
19234/19234 [==============================] - 44s 2ms/step - loss: 1.0772 - accuracy: 0.6607 - val_loss: 1.1981 - val_accuracy: 0.6297

Epoch 00018: val_loss improved from 1.25811 to 1.19811, saving model to Weights/LSTM_epoch18_loss1.1981.h5
Epoch 19/25
19234/19234 [==============================] - 44s 2ms/step - loss: 1.0741 - accuracy: 0.6626 - val_loss: 1.2772 - val_accuracy: 0.6213

Epoch 00019: val_loss did not improve from 1.19811
Epoch 20/25
19234/19234 [==============================] - 45s 2ms/step - loss: 1.0036 - accuracy: 0.6804 - val_loss: 1.1916 - val_accuracy: 0.6429

Epoch 00020: val_loss improved from 1.19811 to 1.19155, saving model to Weights/LSTM_epoch20_loss1.1916.h5
Epoch 21/25
19234/19234 [==============================] - 44s 2ms/step - loss: 0.9491 - accuracy: 0.6974 - val_loss: 1.1141 - val_accuracy: 0.6596

Epoch 00021: val_loss improved from 1.19155 to 1.11408, saving model to Weights/LSTM_epoch21_loss1.1141.h5
Epoch 22/25
19234/19234 [==============================] - 44s 2ms/step - loss: 0.9277 - accuracy: 0.6983 - val_loss: 1.0944 - val_accuracy: 0.6627

Epoch 00022: val_loss improved from 1.11408 to 1.09436, saving model to Weights/LSTM_epoch22_loss1.0944.h5
Epoch 23/25
19234/19234 [==============================] - 44s 2ms/step - loss: 0.8787 - accuracy: 0.7124 - val_loss: 1.0802 - val_accuracy: 0.6665

Epoch 00023: val_loss improved from 1.09436 to 1.08019, saving model to Weights/LSTM_epoch23_loss1.0802.h5
Epoch 24/25
19234/19234 [==============================] - 44s 2ms/step - loss: 0.8474 - accuracy: 0.7222 - val_loss: 1.0245 - val_accuracy: 0.6872

Epoch 00024: val_loss improved from 1.08019 to 1.02446, saving model to Weights/LSTM_epoch24_loss1.0245.h5
Epoch 25/25
19234/19234 [==============================] - 44s 2ms/step - loss: 0.8541 - accuracy: 0.7194 - val_loss: 1.0368 - val_accuracy: 0.6824

Epoch 00025: val_loss did not improve from 1.02446
Estimator: <keras.engine.sequential.Sequential object at 0x7fca6171ca58>
================================================================================
Training accuracy: 76.05%
Testing accuracy: 68.24%
================================================================================
Confusion matrix:
 [[876   0   2 ...  15   0   5]
 [  0  24   0 ...   0   3   4]
 [  0   0 138 ...   0   1  11]
 ...
 [  5   0   0 ...  87   0   0]
 [ 17   3   4 ...   0 433 111]
 [ 14   0   0 ...   0   1 159]]
================================================================================
Classification report:
               precision    recall  f1-score   support

           0       0.84      0.74      0.79      1182
           1       0.89      0.52      0.66        46
           2       0.77      0.75      0.76       184
           3       0.62      0.75      0.68        48
           4       0.74      0.66      0.70       503
           5       0.93      0.87      0.90       225
           6       0.89      0.84      0.86       184
           7       0.85      0.95      0.89        58
           8       0.74      0.92      0.82       134
           9       0.77      1.00      0.87        37
          10       0.87      0.87      0.87       135
          11       0.91      0.95      0.93       310
          12       0.73      0.86      0.79       313
          13       0.83      0.76      0.79        58
          14       0.90      0.81      0.85        43
          15       0.81      0.65      0.72        46
          16       0.69      0.78      0.73        37
          17       0.52      0.65      0.57       814
          18       0.85      0.65      0.74       245
          19       0.81      0.68      0.74        91
          20       0.40      0.70      0.51        27
          21       0.63      0.49      0.55       107
          22       0.92      0.87      0.89       151
          23       0.83      0.93      0.88       298
          24       0.50      0.05      0.08       131
          25       0.41      0.47      0.44       129
          26       0.34      0.61      0.44       483
          27       0.40      0.55      0.46       138
          28       0.86      0.75      0.80        24
          29       0.90      0.85      0.87       151
          30       0.79      0.88      0.83        67
          31       0.77      0.95      0.85        60
          32       0.57      0.21      0.31       214
          33       1.00      0.52      0.69        46
          34       0.83      0.29      0.43        34
          35       0.29      0.08      0.12       105
          36       0.36      0.16      0.22       127
          37       0.74      0.48      0.58       196
          38       1.00      0.39      0.56        18
          39       0.64      0.55      0.59        29
          40       0.81      0.84      0.82       104
          41       0.87      0.65      0.75       662
          42       0.38      0.64      0.48       250

    accuracy                           0.68      8244
   macro avg       0.72      0.66      0.67      8244
weighted avg       0.71      0.68      0.68      8244